Inflammation, Adenoma and Cancer: Objective Classification of Colon Biopsy Specimens with Gene Expression Signature

Gene expression analysis of colon biopsies using high-density oligonucleotide microarrays can contribute to the understanding of local pathophysiological alterations and to functional classification of adenoma (15 samples), colorectal carcinomas (CRC) (15) and inflammatory bowel diseases (IBD) (14). Total RNA was extracted, amplified and biotinylated from frozen colonic biopsies. Genome-wide gene expression profile was evaluated by HGU133plus2 microarrays and verified by RT-PCR. We applied two independent methods for data normalization and used PAM for feature selection. Leave one-out stepwise discriminant analysis was performed. Top validated genes included collagenIVα1, lipocalin-2, calumenin, aquaporin-8 genes in CRC; CD44, met proto-oncogene, chemokine ligand-12, ADAM-like decysin-1 and ATP-binding casette-A8 genes in adenoma; and lipocalin-2, ubiquitin D and IFITM2 genes in IBD. Best differentiating markers between Ulcerative colitis and Crohn's disease were cyclin-G2; tripartite motif-containing-31; TNFR shedding aminopeptidase regulator-1 and AMICA. The discriminant analysis was able to classify the samples in overall 96.2% using 7 discriminatory genes (indoleamine-pyrrole-2,3-dioxygenase, ectodermal-neural cortex, TIMP3, fucosyltransferase-8, collectin sub-family member 12, carboxypeptidase D, and transglutaminase-2). Using routine biopsy samples we successfully performed whole genomic microarray analysis to identify discriminative signatures. Our results provide further insight into the pathophysiological background of colonic diseases. The results set up data warehouse which can be mined further.


Introduction
Colorectal cancer (CRC) is one of the most frequent death-causing tumorous diseases in Western countries. CRC frequently follows various high-risk conditions such as adenomatous polyps and inflammatory bowel disease (IBD). The exact diagnosis of IBD types is still often difficult by conventional histology. High-density oligonucleotide microarray analysis gives an opportunity for studying the genetic and gene expression background, understanding of local pathophysiological alterations and for functional classification of diseases.
To date, microarray analyses reported in the literature were performed predominantly from surgically removed CRC samples [16], while microarray gene expression profiling of adenomas and IBDs as colorectal diseases predisposing to CRC has been studied less. Besides the publications dealing with the comparison of gene expression profiles (GEP) of CRC and normal colonic mucosa [10,21,29], more and more scientific studies appear to focus on gene expression background of CRC progression and metastases development [1-4, 22,25,26,42,43], characterization of CRC subtypes according to mRNA expression [3,14,44], the correlation of GEP with clinicopathological parameters [3,5,7,44], and on the generation of mRNA expression based prognosis [40]. Microarray-based molecular diagnostics of malignancy in colon adenoma and CRC samples were described using 10 [1], 9 [25,26] and 4 [29] adenoma samples compared to adenocarcinoma and normal colonic tissues. Expression microarray analyses of IBD samples were performed in order to determine global GEP of mucosal samples from patients with IBD compared to normal controls [23], to identify novel candidates for ulcerative colitis (UC) and/or Crohn's disease (CD) genetic susceptibility [24,30], to find marker genes involved in IBD-related carcinogenesis [38], to compare expression of entire chemokine family within IBD and normal patients [31], and to examine changes in GEP in peripheral blood mononuclear cells in IBD patients [27].
mRNA expression array analysis is usually performed using high volume surgically removed tissues. In the gastrointestinal tract, biopsy samples are routinely taken during the endoscopical examination with minimal intervention. The mRNA expression study of these samples could allow further insight into the development of inflammatory, preneoplastic and neoplastic diseases, and these biopsy specimens could be suitable samples for identifying early diagnostic target molecules. Colonic biopsies were applied previously for expression array analysis only in several cases [15,23,30,31], because, even today, array technology needs significantly more RNA than can be isolated from tiny biopsy specimens. However, new techniques and commercial kits have recently become available for the reliable mRNA amplification without any effect on the original gene expression pattern [39].
Genome-wide gene expression profiling studies using microarrays have the potential to improve the diagnosis and treatment of human cancers and other disorders. However, recently appeared whole genomic oligonucleotide microarrays representing more than 47000 transcripts have not been used in any type of gastrointestinal disorders.
In the present study we aimed to find discriminatory genes between the main diagnostic groups and to develop and test validation assay system and to confirm the applicability of biopsy samples for microarray analysis-based classification. Another purpose of our work was to search for altered biological pathways for explanation of the pathomechanism of these colonic diseases based on whole genomic mRNA microarray results.

Patients and samples
After informed consent, biopsy samples were taken from the colon during the endoscopical intervention before treatment, and stored in RNALater Reagent (Qiagen Inc., US) at −80 • C. Total RNA was extracted from biopsies of 15 patients with tubulovillous/villous adenomas, 15 with colorectal adenocarcinoma (all microsatellite negative), 9 with active ulcerative colitis (UC), 5 with active Crohn's colitis (CD) and of 8 healthy normal controls (Table 1). Detailed patient specification is described in Supplemental Table 1.

Microarray analysis
Total RNA was extracted using RNeasy Mini Kit (Qiagen, US) according to the manufacturer's instructions. Quantity and quality of the isolated RNA were tested by measuring of the absorbance and agarose gelelectrophoresis. Biotinylated cRNA probes were synthesized from 5-8 µg total RNA and fragmented according to the Affymetrix description using GeneChip cD-NA synthesis reagents and sample cleanup module and Enzo BioArray HighYield RNA Transcript Labeling Kit (https://www.affymetrix.com/support/downloads/ manuals/expression s2 manual.pdf -first version). 10 µg of each fragmented cRNA sample was hybridized into HGU133 Plus2.0 array (Affymetrix Inc.) at 45 • C for 16 hours. The slides were washed and stained using Fluidics Station 450 and antibody amplification staining method according to the manufacturer's instructions. The fluorescent signals were detected by a GeneChip Scanner 3000.

Statistical analysis
Pre-processing and quality control Quality control analyses were performed according to the suggestions of The Tumor Analysis Best Practices Working Group [37]. Scanned images were inspected for artifacts, percentage of present calls (> 25%) and control of the RNA degradation were evaluated. Based on the evaluation criteria all biopsy measurements fulfilled the minimal quality requirements. According to the above recommendations we have applied two different normalization methods: MAS 5.0, and RMA [18]. MAS5.0 applies normalization on an individual chip; it has excellent specificity and good sensitivity. RMA applies cross-project normalization; it has good specificity and excellent sensitivity [37]. Further data analysis and -interpretation have been carried out with both of these pre-processing methods in order to yield the best comparison and normalization properties across all measurements.

Feature selection and cluster analysis
We have arranged the complete dataset consisting of 52 expression measurements into classes according to the histological properties of the samples. This selection procedure resulted in six new datasets (CRC/adenoma/IBD vs. normal, CRC DukesB vs. CRC DukesC-D, non-dysplastic adenoma vs. dyplastic adenoma, UC vs. CD), which were treated as autonomous classification problems. In order to obtain characteristic signal profiles with high predictive power we have applied the "Prediction Analysis for Microarrays" (PAM) [35]. PAM uses soft thresholding to produce a shrunken centroid, which allows the selection of genes with high predictive potential. In our experimental setup the search for a minimum number of genes with maximum predictive accuracy is not promising as we could distinguish two different groups with a very short gene list. Therefore we decided to set the PAM threshold lower for the selection of the top genes and to pick the top 100 genes for each condition. Due to the nature of PAM, at lower threshold the resulting gene list will be longer, but all genes significant at a higher threshold will be included in any selected set. Finally, the overlap of the two lists -based on two different normalizationswas taken for further analysis. The functional classification of the most differentially expressed genes were performed according to the analysis of RMA top 100 genes in each main disease groups compared to normal controls.
To visualize the discriminative patterns, hierarchical clustering was performed using the Genesis software [34]. Annotation was performed using the Affymetrix Netaffx analysis centre (http://www. affymetrix.com/analysis/index.affx). The combined datasets for further analysis are available in the Gene Expression Omnibus databank (http://www.ncbi.nlm. nih.gov/geo/), series accession number: GSE4183.

Discriminant analysis
Further data analysis was performed in the SAS v.8.2 (SAS Institute Inc. Cary, NC, USA) program package and the discriminant analysis in SPSS v.15.0 program (SPSS Inc., USA). We have performed a stepwise discriminant analysis among the groups by forward selection of quantitative variables. The set of variables that make up each class was assumed to be multivariate normal with a common covariance matrix. Finally we used the significant discrimination model (Wilks' Lambda was significant) and we have fixed the most important discriminatory genes. At the end of the analysis we have made the Leave-one-out classification table.

Taqman RT-PCR
TaqMan real-time PCR was used to measure the expression of 52 selected genes using an Applied Biosystems Micro Fluidic Card System in 36 samples, where sufficient RNA could be extracted ( Table 1). The measurements were performed using an ABI PRISM 7900HT Sequence Detection System as described in the products User Guide (http://www.appliedbiosystems.com, CA, USA). For data analysis the SDS 2.2 software was used. The extracted delta Ct values (which represent the expres-sion normalized to the ribosomal 18S expression) were grouped according to the histological groups. Then the Student's t-test was performed to compare the expression values between groups.

Identification of discriminatory genes among the main diagnostic groups
CRC cases are characterized by differentially expressed genes involved in cell proliferation (pleiotrophin, insulin-like growth factor binding protein 5, REG1A, WNT1 inducible signaling pathway protein 1), adhesion (MCAM, collagens, enactin, laminin gamma 1), and transport (such as aquaporin 8, lipocalin 2 and collagen 4 A1). Adenoma cases showed altered gene expression data in transport (like AB-CA8, TRPM6), adhesion (such as CXCL12, CD44, ADAM-like decysin 1), metabolism (like carbonic anhydrase I) and proliferation (such as MET protooncogene) functional groups. IBD cases are featured by the gene expression changes of immune regulation (such as IFITM3, IFITM1 and proteasome subunit, beta type, 9), cell proliferation (REG1A, tryptophanyl-tRNA synthetase, interferon stimulated gene 20kDa) and metabolism (like chitinase 3-like 1, carbonic anhydrase I, zinc finger protein 91). The top 100 genes for each condition were picked and the overlaps of the two lists, based on two different normalizations were determined. Discriminatory genes between CRC/adenoma/IBD and normal biopsy samples are listed in Table 2, and considered as the disease type specific mRNA expression markergroups. The list of the top 100 genes for each analysis setting including complete annotation and the complete microarray dataset are shown as supplementary information in Supplemental Table 2. The average within-PAM cross-validation misclassification error was found to be 7.3%.
To visualize the expression changes, we have clustered the selected top genes of all biopsy samples to detect similarities across the sample groups (Fig. 1). The dendrogram of the 52 colonic cases shows the discrimination potential of the selected genes. The transcripts representing the same gene are found to fit into the same cluster.

Discrimination of colonic disease subtypes
The subclassifying genes within the main disease groups were also identified. The metabolic and transport processes (representing genes like DnaJ homolog subfamily C member 10, coatomer protein complex subunit beta) mainly differ between CRC subgroups. In advanced stages of CRC downregulation of apoptosis (such as TIA1 cytotoxic granule-associated RNA binding protein, forkhead box O3A), and immune response (like immunoglobulin heavy constant mu, 2'-5'-oligoadenylate synthetase 2) was observed, while carbohydrate, fatty acid metabolism (like glutaminefructose-6-phosphate transaminase 1, GDP-mannose 4,6-dehydratase, sterol-C4-methyl oxidase-like, lanosterol synthase) and energy metabolism related genes (such as ATPase inhibitory factor 1) showed higher mR-NA expression levels in parallel with CRC progression. In adenoma samples upregulation of proliferation (such as interferon gamma-inducible protein 16, aminopeptidase A, tumor protein D52-like 2) and DNA replication and transcription regulation (like IGF1, nuclear factorlike 3, zinc finger protein 452) and downregulation of immune and defense response (such as immunoglobulin heavy constant mu, T cell receptor gamma variable 9, interferon regulatory factor 4 and tryptase alpha) were found during the development of dysplastic alterations. CD cases are mainly featured by increased expression of carbohydrate metabolism genes (like galactosidase alpha, fructose-2,6-biphosphatase 4, maltaseglucoamylase), while certain cell proliferation (such as septin 10, platelet derived growth factor D, cyclin G2), apoptosis (such as BCL10, BIRC4, egl nine homolog 3), immune regulation (like decay accelerating factor, CD24, CEACAM1), transport (like dual oxidase 2, P450 (cytochrome) oxidoreductase and lipocalin 2), and ubiquitin-dependent protein catabolism (such as tripartite motif-containing 31, ubiquitin-conjugating enzyme E2, J1, and Ubiquitin specific protease 53) genes were found to be overexpressed in UC compared with CD cases. However, the function of many sub-  type discriminatory genes has not been identified yet. Best differentiating markers between UC and CD were cyclin-G2, tripartite motif-containing-31, TNFR shedding aminopeptidase regulator-1 and AMICA. The functional classification of most differentially expressed genes was performed according to the analysis of RMA top 100 genes in each disease type subgroups. The list of the top 100 genes for each analysis setting including complete annotation and the complete microarray dataset are shown as supplementary information (Supplemental Table 2). Hierarchical cluster diagrams of the subgroups, based on RMA top 100 differentially expressed genes can be seen on Fig. 3.

Gene ontology of selected features
The representative gene ontology categories are shown to Tables 2 and 3. We have also mapped the selected features to chromosomes (Supplemental Fig. 1).

Taqman validation
Selection criteria for genes were the different expression in microarray analysis and the availability of validated TaqMan probes. Ten "literature" genes were also selected which were described as CRC related genes. The complete results of the TaqMan measurements are presented on Supplemental Table 3. Forty six of the 52 measured genes correlated with the results obtained using Affymetrix microarrays at a significance of p < 0.05. The expression changes of the selected genes are summarized in Table 4. The mRNA expression levels of selected discriminatory genes measured by Taqman RT-PCR are presented on Supplemental Fig. 2. Global clustering of all samples using the Taqman validated genes were also performed (Supplemental Fig. 3). Normal and UC cases belong to two distinct clusters, but clusters of CRC and adenoma cases are not clearly separated, demonstrating the expressional heterogeneity of CRC.

Discussion
Gene expression profiling of 52 colonic biopsy samples was done by whole genomic HGU133 Plus 2.0 microarrays in order to identify disease specific gene expression markergroups for objective classification. We aimed to develop diagnostic mRNA expression patterns for indentification of adenoma and different staged CRC and of the minimal list of genes which is suitable for discrimination of different types of IBD. Examination of adenoma with/without dysplasia, CRC, IBD and normal samples in parallel can help to find condition specific gene expression alteration with a lower risk of unspecificity due to methodical reasons. Comparative microarray analysis of biopsies from all of these kinds of colonic diseases has not been reported so far in the scientific literature. Oligonucleotide whole genomic microarray analyses of biopsy samples were found to be highly standardized, reproducible and provided high quality array results regarding the array     sensitivity, present percentage and GAPDH 3'/5' ratio. This type of analysis results in discriminative signatures, and gives an insight into the pathophysiological background of colonic diseases, and additionally, provides a data warehouse which can be further mined for in-depth pathway analyses. As recently described, joint analysis is more efficient than replication based analysis for two-stage genome-wide association studies [33]. Therefore we used a one-stage genome wide analysis to identify relevant gene expression signatures. For a classification problem comparable to our study a previous estimation suggested a required sample size of 51 subjects to detect a 2-fold change of expression level at alpha = 0.001 at the 90th percentile [41]. The main disease groups were individually compared to healthy controls. CRC samples were unequivocally distinguished according to the expression level of 13 genes. Six of them were validated by Taqman RT-PCR. Among the discriminatory genes lipocalin 2 (LCN2), collagen 4 alpha 1 (COL4A1) and aquaporin 8 (AQP8) were mentioned earlier as CRC-associated genes. LCN2 transport molecule acts as chemotactic agent and also regulates the matrix metalloproteinase-9 activity. AQP8 water channel protein is a marker for non-proliferative colonic epithelial cells, but it is not expressed by adenoma and CRC in protein level. In correlation with the findings of Fischer et al., lower AQP8 mRNA level was found in CRC and adenoma samples in our study also [13].
Adenoma cases were characterized and distinguished according to the expression changes of 27 overlapping genes. Seven of them (overexpressed CD44 and MET, and underexpressed GUCA2A, CXCL12, TRPM6, ABCA8, ADAMDEC1) were confirmed by Taqman RT-PCR. CD44 cell surface glycoprotein antigen is a receptor for hyaluronic acid, which can also bind osteopontin, EGFR, matrix metalloproteases and IGFBP3. The expression changes of CD44 can affect several different cellular pathways including EGFRrelated proliferation and tumorigenesis, tumor tissue remodeling and immune processes. Hepatocyte growth factor receptor (MET), which was found to be overexpressed both in colon adenoma and CRC, may play an important role in colorectal tumorigenesis. Similarly to the results of Trovato et al., in our study elevated c-met mRNA level was observed both in adenoma and CRC biopsies, but CRC samples showed lower c-met expression than adenomas. Reduced expression of cmet can be associated to the progression of adenoma into carcinoma [36]. GUCA2A (guanylin) plays role in the regulation of ion transport in the colon. Expression of guanylin is downregulated in human intestinal adenomas, moreover, recent results suggest that loss of guanylin activity leads to or is a result of adenocarcinoma [8]. The chemokine ligand 12 (CXCL12) which was found underexpressed both in different carcinomas earlier [32] and adenoma samples in our study, regulates cAMP production and ion transport in intestinal epithelial cells [12]. These data can support that alterations in ion transport of the colon are involved in colorectal carcinogenesis. The exact cellular function and role in adenoma and tumor development of TRPM6, ABCA8 and ADAMDEC1 gene products have not yet been determined.
Several genes were found to show elevated mRNA level according to the adenoma-CRC sequence. TCF4 is a transcriptional factor involved in Wnt-signalling pathway which is altered in over 90% of CRCs. TCF4 participates in transcriptional regulation of genes associated with colon carcinogenesis including c-myc, cyclin D1, TCF1, PPARδ, MMP7 and MDR1. Tensin 1 (TNS1) which was also overexpressed in CRC compared to adenoma samples can also induce JNK and p38 activation leading to increased cell survival. RBMS1 is an another gene which was detected as being upregulated in CRC compared to adenoma, it is also involved in the malignant transformation process. RBMS1 is a modulator of c-myc, deregulates cell cycle controls and leads cells towards transforming pathways [28]. SPARC (osteonectin) was detected as overexpressed gene in CRC in several microarray-based studies [10,29,42]. It is thought to play an important role in tissue remodeling, angiogenesis, and tumorigenesis. Controversial and conflicting data were published about the expression and function of insulin-like growth factor binding protein 3 and SPARCL1 in different types of cancers including CRC [19,20].
Ten discriminatory transcripts distinguish between IBD samples and normal tissue. Overexpression of different interferon-induced genes is highly represented among the discriminatory genes. Interferon induced transmembrane protein 3 (IFITM3) was strongly expressed in severely inflamed colonic mucosa of UC both in our microarray analysis and other studies [17]. Moreover, IFITM3 showed high mRNA level in sporadic cancers, and UC-associated cancers, therefore it can be a marker for identification of high cancer-risk group within the UC. PSMB9 and UBE2L6 interferoninduced discriminatory genes are in connection with the enhanced antigen processing and presentation. LCN2 which has been mentioned above as an upregulated CRC-associated gene is also overexpressed in colonocytes and neutrophils in inflamed lesions of UC. Similarly to our findings, highly increased LCN2 mRNA levels were measured in UC samples in other microarray studies [11,24]. Alteration of epithelial magnesium absorption was also observed in our IBD samples, as the TRPM6 (transient receptor potential cation channel, subfamily member 8) showed lower mRNA level. CD cases are mainly featured by increased expression of carbohydrate metabolism genes, while certain cell proliferation, apoptosis, immune regulation, transport, and ubiquitin-dependent protein catabolism genes were found to be overexpressed in UC compared with CD cases. However, the function of many dicriminatory genes has not been identified yet. Significant overexpression of cancer-related genes (CEACAM1, -7, CD24, PDGFD) in UC is potentially important, considering reports of increased risk of developing CRC in this disease [6,9]. For validation of the marker properties of a given gene, we should use homogeneous high case-number sample groups for microarray analysis, and further experiments, but at least RT-PCR confirmation, are needed. We focused on individual markers and independent validation of markers. Ninety-four percent of 52 selected genes which were found to be over-or underexpressed was confirmed by Taqman RT-PCR.
In conclusion, in our study we were able to distinguish not just between normal, adenoma, CRC and IBD samples, but also among the different stages of CRC using only easily-taken biopsy specimens. With a large number of samples one can establish principal gene lists that characterize distinct conditions, and if miniarrays will be commercially available, the daily routine of diagnosis may be quicker and easier.  Figure 2. mRNA expression levels of selected discriminatory genes measured by Taqman RT-PCR. dCt is the expression value normalized to the ribosomal 18S protein.
Supplement Figure 3. Global clustering of all samples measured by Taqman RT-PCR according to the Taqman validated genes. Probably due the heterogeneity of CRC samples they do not cluster together during global clustering using Taqman-generated results.