Integrating Multiple Microarray Data for Cancer Pathway Analysis Using Bootstrapping K-S Test

Previous applications of microarray technology for cancer research have mostly focused on identifying genes that are differentially expressed between a particular cancer and normal cells. In a biological system, genes perform different molecular functions and regulate various biological processes via interactions with other genes thus forming a variety of complex networks. Therefore, it is critical to understand the relationship (e.g., interactions) between genes across different types of cancer in order to gain insights into the molecular mechanisms of cancer. Here we propose an integrative method based on the bootstrapping Kolmogorov-Smirnov test and a large set of microarray data produced with various types of cancer to discover common molecular changes in cells from normal state to cancerous state. We evaluate our method using three key pathways related to cancer and demonstrate that it is capable of finding meaningful alterations in gene relations.


Introduction
Microarray technology, monitoring mRNA abundance of tens of thousands of genes simultaneously, provides an efficient tool to characterize a cell at the molecular level. It has been applied to a variety of research areas, ranging from biomarker detection [1,2] to gene regulatory networks [3][4][5] and cancer classification [6][7][8]. When applied to cancer research, microarray technology typically measures gene expressions of cancer and normal tissues or different types of cancer. One important area in microarray-based cancer research is to identify genes that are differentially expressed between cancerous and normal cells and to discover diagnostic and prognostic signatures in order to predict therapeutic responses. Over the years, many statistical methods for the identification of differentially expressed genes have been developed, and most of them focused on the expression analysis of individual genes [9][10][11][12][13][14][15]. However, the simple list of individual differentially expressed genes can only tell us which genes are altered by biological differences between different cell types and/or states. It cannot explain the reasons for the significant alterations in gene expression levels and the effects of such changes on other genes' activities. It is well known that in a biological system genes interact with each other forming various biological pathways in order to carry out a multitude of biological processes. To better understand the roles of these differentially expressed genes and their interactions in a complex biological system, a comprehensive pathway analysis is needed. Since the identification of biological pathways is significantly influenced by those differentially expressed genes from different datasets or different statistical methods [16,17], we reason here that an integration of multiple cancer microarray datasets and identification of the most common pathways from these data would reveal key relationships between crucial genes in carcinogenesis. Our focus on the interactions and pathways of cancer-related genes is important since changes in gene relations and key pathways are more relevant to carcinogenesis than individual genes alone.
Several statistical methods have been proposed for the analysis of differential gene coexpression patterns. Li [18] observed differences of gene coexpression patterns in different cellular states and attributed these changes in gene coexpression patterns to some third set of influential genes.
Lai et al. [19] proposed a similar method to identify differential gene-gene coexpression patterns in cells from normal state to cancerous state. However, these methods often perform the analyses on one single microarray dataset and typically generate unreliable results; the results from different microarray datasets and various statistical methods could hardly overlap using these methods [20,21]. Therefore, the confidence level for discoveries based on these methods is low. Furthermore, these methods fail to grasp the common molecular changes in cells transitioning from a normal state to the cancerous state. Choi et al. [22] introduced a model to find differential gene coexpression patterns related to cancer by combining independent datasets for different cancers. They used a model similar to the t-test, which only considered the mean and variance of two groups of samples. It is well known that traditional t-test has two disadvantages for microarray data analysis: first, it assumes that the datasets under analysis have a normal distribution, which is usually violated in microarray datasets; second, if the number of genes is large and the number of samples is small, some of the standard deviations will be extremely small, and therefore the test statistics will be very high, which may lead to a significant bias. Nonparametric statistical test methods, such as the K-S test, require fewer assumptions for the data and may be preferred, especially, when the number of samples is small.
In this paper, we propose a novel method to detect the differentially changed gene relations in cancer versus normal tissues. We collect 36 datasets across different microarray platforms and from various types of cancer. These 36 datasets contain both normal and tumor samples, which can subsequently yield two Pearson correlation coefficient vectors for every gene pair, one for normal samples and the other for tumor samples. We then perform a bootstrapping K-S test to identify some differentially changed gene relations. Finally we verify our results with three key pathways related to cancer and demonstrate that our method can find some meaningful alterations of gene relations.

Microarray Datasets.
We collected 36 microarray datasets from NCBI (Gene Expression Omnibus GEO) [23]. As shown in Table 1, these microarray datasets contain both normal and tumor samples across 21 different types of cancer, and their platforms come from one of the three platforms: GPL570 (Affymetrix GeneChip Human Genome U133 Plus 2.0 Array), GPL96 (Affymetrix GeneChip Human Genome U133 Array Set HG-U133A), and GPL91 (Affymetrix GeneChip Human Genome U95 Version Set HG-U95A). We divided every dataset into two expression data matrices: one matrix includes all normal samples, and the other includes all tumor samples. To integrate multiple microarray datasets across different platforms, we mapped each probe in different platforms to a unique Entrez Gene ID or a unique UniGene symbol. For genes with more than one probe in one platform, we chose the probe with the highest mean expression value.

Cancer-Associated Pathways and Extended Gene Networks.
We applied our method to analyze three cancerassociated pathways. These pathways are related to three common traits in most and perhaps all types of human cancer: self-sufficiency in growth signals, insensitivity to antigrowth signals, and evading programmed cell death (apoptosis) [24]. In fact, Hanahan and Weinberg have already identified some signaling pathways to demonstrate the capabilities cancer cells acquire during tumor development in [24]. We extended these signaling pathways to three relatively complete and larger cancer-associated pathways (antigrowth signaling, apoptosis, and growth signaling pathways) from the cell cycle pathway, the apoptosis pathway and the MAPK pathway in KEGG [25]. We used these three pathways (i.e., cell cycle, apoptosis, and MAPK pathways) as our seeds and the genes in these pathways as our seed genes. Next we constructed three gene networks corresponding to the three cancer-associated pathways from HPRD (Human Proteins Reference Database, http://www.hprd.org/) and TRANSFAC [26] based on seed genes and their interacting partners. We downloaded the protein-protein interaction (PPI) data released by HPRD on September 1, 2007. This PPI dataset contains 37107 human binary protein-protein interactions whose supporting experiments are indicated as in vivo, in vitro, or yeast two-hybrid. We also collected 1042 transcription factor-target gene relations on human species from TRANSFAC. So our gene networks included seed genes, protein interaction partners, and transcription factors (TFs) of seed genes or target genes for which seed genes served as their TFs.

Detecting Differential Relations by Bootstrapping K-S Test.
We used the Kolmogorov-Smirnov test (K-S test) to determine whether the distributions of values in two datasets differed significantly. The two-sample K-S test is the most useful for comparing two samples because it is nonparametric and distribution-free [27]. The null hypothesis for this test is that two datasets are drawn from the same distribution. The alternative hypothesis is that they are drawn from different distributions.
For n i.i.d samples X 1 , . . . , X n with some unknown distribution, we can define an empirical distribution function by where X 1 , . . . , X n are ordered from the smallest to the largest value. The Kolmogorov-Smirnov statistic for a given function S(x) is D n will converge to 0 if the sample comes from distribution S(x) [27]. Moreover, the cumulative distribution function of Kolmogorov distribution is It is easy to prove that √ nD n = √ n max x |S n (x) − S(x)| will converge to the Kolmogorov distribution [27]. Therefore if √ nD n > K α = Pr(K ≤ K α ) = 1 − α, the null hypothesis for the Kolmogorov-Smirnov test will be rejected at level α.
For the case of determining whether the distributions of two data vectors differ significantly, the Kolmogorov-Smirnov statistic is Journal of Biomedicine and Biotechnology and the null hypothesis will be rejected at level α if The P-value from the K-S test can measure the confidence of the comparison results against the null hypothesis. Obviously, the smaller the P-value, the more confident we are of rejecting the null hypothesis. Assume that we have n microarray datasets and a list of m genes, we denote the expression data matrix for normal samples as and the expression data matrix for tumor samples as where p(k) is the number of normal samples in the kth dataset, and q(l) is the number of tumor samples in the lth dataset.
For these two types of expression data matrices, each row represents one gene, and each column represents one sample. The correlation coefficient for gene i and gene j from the kth normal sample can be calculated by where X k i is the average value of expression levels for gene i. The correlation coefficient for every gene pair from tumor samples can be calculated similarly.
We use the bootstrapping K-S test to detect some gene relations with different PC (Pearson coefficient) distributions. The bootstrapping method generates N bootstrapping samples NPC and TPC by repeatedly sampling with replacement from the original NPC i j and TPC i j (e.g., Step 4), respectively. It can give us an empirical distribution of Pvalue θ, with which, we can estimate the probability that the distribution of two PC vectors are different. In our computational experiment, for a gene pair, if its value of Pr(θ < 0.05) was larger than 0.8, we considered it as a pair of genes with the correlation relation significantly different between normal and cancer cells.
Our method can be described as follows.
Step 1. Compute n correlation coefficient Matrices NPC 1 -NPC n from the normal samples in n datasets for every gene pairs. For example, NPC 1 is an m × m Matrix from normal samples in the first dataset, and NPC 1 i j represents the correlation coefficient between gene i , and gene j.
Step 2. Compute n correlation coefficient Matrices TPC 1 -TPC n from the tumor samples in the n datasets for every gene pair.
Step 3. For every gene pair (gene i and gene j), let Step 4. Perform the following (N is the number of samples we will generate using bootstrapping). for k = 1 to N Do generate bootstrap samples NPC and TPC from NPC i j and TPC i j , respectively. θ k = P-value of K-S test on NPC and TPC. End-for Output Pr(θ < 0.05) = (θ < 0.05)/N.

Experimental Results
In this section, we applied the bootstrapping K-S test method to analyze three cancer related pathways.

Antigrowth Signaling Pathway.
Antigrowth signals can control proliferation in normal samples. Cancer cells have the ability to evade these antiproliferation signals. In the antigrowth signaling pathway, transforming growth factor beta (TGFβ) initiates this pathway by binding to two TGFβ receptors, Tgfbr1 and Tgfbr2. These two activated Tgfβ receptors can phosphorylate Smad2, Smad3, and Smad4 [28]. The SMAD family proteins then transduce antigrowth signals to the cell cycle inhibitors p21, p16, p27, and p15, which can inhibit the action of cyclin-CDK complex. The cyclin-CDK complex can phosphorylate RB and make RB dissociate from the E2F/RB complex to liberate E2F to activate the cell cycle procession from G1 to S phase (Figure 1(a)). There are 19 genes in the antigrowth signaling pathway (Figure 1(a)). We found 689 unique genes related to these 19 genes from TRANSFAC and HPRD. Among these 708 genes, there were 4215 paired gene interactions, among which the correlation relations of 47 gene pairs were identified as significantly changed between normal and cancer cells. Among these 47 relations, we detected a cluster around SMAD family proteins which contained 15 relations with different distributions between normal samples and tumor samples (Figure 1(b)). Most of them came from largescale protein-protein interaction experiments without the associated molecular function. For example, (Smad1-Arl4d), (RHOD-Smad2), and (WEE1-Smad3) in [29], (PAPOLA-Smad2), (SNRP70-Smad5), (GPNMB-Smad4), (PSMD11-Smad3), and (Smad9-MBD1) in [30], and (EWSR1-Smad4) in [31], all of them were detected based on large-scale protein-protein interaction experiments without annotation  of molecular function. Our results indicate that although their associated functions and internal mechanisms are still unclear, these gene pairs are related to the TGFβ-SMAD signaling pathway, and the relation between the two genes in each pair is significantly different in cancer and normal cells. Additionally, we identified some differentially changed relations with known molecular functions as follows: (1) MAGI2 (a.k.a. ARIP1)-Smad3. MAGI2 (ARIP1) can interact with Smad3, and overexpression of ARIP1 can significantly suppress Smad3-induced transcriptional activity [32]. We validated this from the boxplot for MAGI2 (ARIP1)-Smad3 (Figure 2(a)). In normal samples, MAGI2 (ARIP1) and Smad3 showed a high positive correlation, while they had a high negative correlation in tumor samples.
(2) EWSR1-Smad4. Although the experiment type of the interaction between EWSR1 and Smad4 is yeast two-hybrid [31], mutations in EWSR1 are known to cause Ewing sarcoma and other members of the Ewing family of tumors [33]. From the boxplot for EWSR1-Smad4, we found that the third quartile is the densest part of the whole distribution for both normal and tumor samples. The third quartile for normal samples showed a positive correlation whereas that for tumor samples showed a negative correlation (Figure 2(b)). Therefore, we suspect that EWSR1 can suppress the activity of Smad4 in tumor samples.
(3) TRAP1-Tgfbr2. TRAP1 has been shown to bind to TGFβ receptors and play a role in TGFβ signaling pathway. TRAP1 can interact with Smad4 and affect the SMAD-mediated signal transduction pathway. Mutant TRAP1 can prevent the formation of the Smad2-Smad4 complex to inhibit the TGFβ Signaling pathway [34]. In the boxplot for TRAP1-Tgfbr2 (Figure 2(c)), the densest quartile for tumor samples showed a high negative correlation.

Apoptosis Pathway.
Cancer cells have the ability to evade programmed cell death or apoptosis. TNFα, FASL, TRAIL, and other genes can initiate apoptosis by binding to their receptors such as TNFR1, FAS, and TRAIL-R. Many apoptosis signals induce mitochondrial changes.
Mitochondria can help transduce the apoptosis signals by releasing cytochrome C (Cytc), a potent catalyst of apoptosis. There are two different Bcl-2 family members: proapoptotic members (Bid, BAD) and antiapoptotic members (Bcl-2, Bcl-xl), which activate and inhibit, respectively, the release of Cytc. Finally, two key caspases (Casp8 and Casp9) activate other downstream caspases that perform the cascading events of cell death (Figure 3(a)).
In our results, we detected 33 relations with different distributions in the apoptosis pathway, and some are supported by existing experimental evidence. Examples include (Figure 3(b)) the following: (1) PUMA-Bcl-XL (BCL2L1). PUMA can interact with Bcl-XL and meanwhile PUMA can also neutralize and antagonize all the Bcl-2-like proteins [35]. From the boxplot for PUMA-Bcl-XL, we can find that Bcl-XL, and PUMA showed a higher negative correlation in normal samples than in tumor samples (Figure 4(a)).
(2) AKT1-BAD. Active forms of Akt can phosphorylate BAD in vivo and in vitro to prevent it from promoting cell death [36]. In the boxplot for AKT1-BAD, the first quartile, the densest quartile for normal samples, showed a higher positive correlation than the second quartile, the densest for tumor samples (Figure 4(b)). So we speculated that Akt can suppress BAD's activity in tumor samples.
(4) TNFR1-RIPK1 (RIP). The interaction between the death domain of TNFα receptor-1 (TNFR1) and TRADD can trigger distinct signaling pathways leading to apoptosis. TRADD also interacts strongly with another death domain protein; RIP and RIP plays an important role in the TNF signaling cascades leading to apoptosis [38]. In the boxplot for TNFR1-RIPK1, TNFR1 and RIPK1 exhibited high positive correlation in normal samples (Figure 4(d)).
(5) TNFR1-RASSF1. RASSF1A is a tumor suppressor gene. Apoptosis initiation by TNFα or TRAIL recruits RASSF1A and MAP-1 to form complexes. RASSF1A and MAP-1 are the key links between death receptors and the apoptotic machinery [39]. This was verified by the Boxplot for TNFR1-RASSF1. In most normal samples, these genes showed a high positive correlation. In most tumor samples, they showed a zero or negative correlation (Figure 4(e)).
(6) IAP-CASP9. Inhibitor of apoptosis (IAP) suppresses the activities of caspases and inhibits different apoptotic pathways [40]. IAP and CASP9 showed a high negative correlation in tumor samples (Figure 4(f)). Among the eight differential gene relations in Figure 3(b), three of them were in the seed pathway: TRAIL-R → FADD, IAP → CASP9, and AKT → BAD, which demonstrates the effectiveness of the proposed method.

Growth Signaling Pathway.
Cancer cells have the ability to produce their own growth promoting signals. EGF, TGFα, and PDGF are activated and then bind to their receptors to transduce the growth signals. The activated growth factor receptors can in turn activate the SOS-Ras Raf Mapk cascade. In the growth signal pathway ( Figure 5), Ras, JUN, and Fos are oncogenes.
We could find 68 relations with different distributions in the growth signal pathway, and we discuss three relations as follows: (1) RASSF2-KRAS. Although different forms of Ras are frequently thought of as oncogenes, they also have the ability to produce antigrowth effects such as cell cycle arrest, differentiation, and apoptosis. RASSF2 can bind directly to K-Ras. Moreover, RASSF2 can inhibit the growth of tumor cells, and the activated K-Ras can enhance this ability [41]. This might be why RASSF2 and RAS showed a high positive correlation in normal samples in the boxplot (Figure 6(a)).
(2) MAZ-MYC. The MAZ family can increase the oncogene MYC's transcriptional activity [42]. As expected, MAZ and MYC demonstrated a higher positive correlation in tumor samples (Figure 6(b)).
(3) PLSCR1-EGFR. Activated epidermal growth factor receptors (EGFRs) can both physically and functionally interact with PLSCR1. In turn, PLSCR1 can interact with Shc and thus accelerate the activation of Src kinase through the EGF receptor, while Src can initiate some activating pathway for the oncogene JUN [43]. In the boxplot for PLSCR1-EGFR, the densest quartile for normal samples showed a low negative correlation, whereas the densest quartile for tumor samples showed a low positive correlation ( Figure 6(c)).

Conclusion and Discussion
After several decades of cancer research, some details of the underlying mechanisms of cancer at the gene level are still unclear. In this paper, we propose an integrative method based on the bootstrapping K-S test to evaluate a large number of microarray datasets generated from 21 different types of cancer in order to identify gene pairs that have different relationships in normal versus cancer tissues. The significant alteration of gene relations can greatly extend our understanding of the molecular mechanisms of human cancer. In our method, we obviate the disadvantage of the traditional t-test, which only considers the mean and variance of samples and fails in the analysis of microarray data with small numbers of samples. Instead of the t-test, we propose the use of the bootstrapping K-S test method to detect gene pairs with different distributions of Pearson correlation coefficient values in normal and tumor samples. The experimental results demonstrated that our method could find meaningful alterations in gene relations and opened a potential door for further cancer research.