Towards a Holistic, Yet Gene-Centered Analysis of Gene Expression Profiles: A Case Study of Human Lung Cancers

Genome-wide gene expression profile studies encompass increasingly large number of samples, posing a challenge to their presentation and interpretation without losing the notion that each transcriptome constitutes a complex biological entity. Much like pathologists who visually analyze information-rich histological sections as a whole, we propose here an integrative approach. We use a self-organizing maps -based software, the gene expression dynamics inspector (GEDI) to analyze gene expression profiles of various lung tumors. GEDI allows the comparison of tumor profiles based on direct visual detection of transcriptome patterns. Such intuitive “gestalt” perception promotes the discovery of interesting relationships in the absence of an existing hypothesis. We uncovered qualitative relationships between squamous cell tumors, small-cell tumors, and carcinoid tumor that would have escaped existing algorithmic classifications. These results suggest that GEDI may be a valuable explorative tool that combines global and gene-centered analyses of molecular profiles from large-scale microarray experiments.


INTRODUCTION
The simultaneous measurement of expression levels of tens of thousands of genes in a biological sample enabled by DNA microarray technology has provided a new and powerful way to characterize the molecular basis of diseases such as cancer [1,2]. In the past decade, mRNA expression profiles of tumor tissues have been successfully used to distinguish tumor types or subtypes [3][4][5]. They also appear to hold great promise as a method for predicting clinical outcomes [6][7][8]. For example, gene expression profiles have been used to classify lung adenocarcinoma into subgroups that correlated with the degree of tumor differentiation as well as patient survival [9].
Gene expression profile analysis initially emphasized the identification of groups of genes that are differentially regulated in different experimental conditions or patient samples. Coexpression across a variety of samples implied coregulation or similar function [10,11]. An approach complementary to this "gene-centered" view is to take a "samplecentered" perspective in which one treats the genome-wide profiles of each sample as the entities to be classified with respect to their gene expression patterns. The goal here is to assign samples (rather than genes) to groups based on the high-dimensional molecular signature determined by the thousands of individual gene expression values. While the gene-centered perspective is useful for understanding the molecular pathways in which individual genes are involved, the sample-centered view is more relevant for biological and clinical questions, such as in the study of the developmental and pathogenetic relationship between tissues as a whole [12,13] or the identification of prognostic or diagnostic signatures of tumors based on entire gene expression profile portraits [4,[14][15][16][17][18][19].
The notion of "molecular portraits" has gained importance as gene expression profiles for increasingly large numbers of samples or conditions (eg, experimental variables, patients, treatment groups, etc) have become available [18,20,21]. However, the analysis of large numbers of gene expression profiles as integrated entities poses a challenge in terms of how to best organize and graphically present the highdimensional data without loss of the notion of an individual profile as an independent entity. It would be desirable to capture the global picture of sample clusters within one visual representation while simultaneously presenting the specific expression pattern within each individual sample, and hence, simultaneously allowing gene-specific analysis.
Current representations, such as the widely used heat maps in two-way hierarchical clustering [22,23] or coordinate systems in principal component analysis (PCA), multidimensional scaling (MDS) and their variants [24][25][26], compress the expression profile information of a sample into a single quantity, such as a scalar value for the distance (dissimilarity) between the sample, a branch in a dendrogram, a narrow column in a heat-map, or a point in reduceddimensional space. Such aggregate displays discard possibly relevant information immanent in the complex, higherorder (system-level) genome-wide expression pattern. This intrinsic but hidden information reflects the collective behavior of genes orchestrated by genome-scale gene regulatory networks that govern cell behavior [27]. As pathology and radiology teach us, the implicit visual cues present within a complex image (eg, histological section, radiograph) cannot be reduced to a set of numerical variables without loss of system-level information content. Thus, it is possible that some "irreducible" information contained within highdimensional gene profiles of patient or experimental samples may be lost in current clustering and representation methods.
In the absence of specific questions or hypotheses, it would therefore be desirable to be able to directly compare microarray results of individual tumor samples with their complete feature-richness in the same "holistic" way as pathologists compare histological tumor samples, namely, based on human gestalt perception [28]. In contrast to histological patterns, the thousands of expression values in a microarray measurement are too dense and irregular to be directly interpreted in a holistic manner. Hence, they must be presented in a form appropriate for human pattern recognition without discarding the global, higher-order information.
Self-organizing maps (SOMs) have the capacity to display information-rich diagrams. In the case of microarray data they can present individual samples as an entity and, at the same time, display high-resolution patterns within the transcriptome. A self-organizing map is a neural network algorithm for unsupervised machine learning with a strong visualization capability [29]. In brief, it assigns a set of N input objects (eg, genes) to a number K (K < N) rectangular or hexagonal "tiles" (SOM nodes), each of which represents a cluster of objects (genes), arranged so as to form a coherent pattern within a two-dimensional "mosaic" (SOM grid). The patterns arise because the distances between the tiles on the mosaic are a function of the similarity between the gene clusters that the tiles represent, with most similar clusters being adjacent to each other in the mosaic.
Early applications of SOMs for visualization of gene expression profiles emphasized the gene-centered perspective (clustering of genes) and used each tile to represent one cluster of genes in order to identify gene clusters with interesting expression patterns or to link them to gene functions [30]. As in k-means clustering [31], the number of clusters K is chosen in this approach to approximate the number of expected number of gene clusters, for example, K = 12 on a 3×4 = 12 node grid. Other studies used SOM in the sample-centered mode to map individual tumor samples onto the SOM grid and thereby classify tumor samples into a small number of diagnostic or prognostic groups [32,33]. In both cases, an entire experiment consisting of multiple samples (expression profiles) was represented by one single SOM grid, and the sample-specific visualization capabilities of SOM were not explored. In another study, SOMs were used in the genecentered mode to analyze lymphoma samples, but the number of clusters (K = 22 × 14) was much larger than the expected number of biological clusters. This use of SOM generated "high-resolution" mosaics, one for each sample in an experiment. The characteristic SOM mosaics contained coherent patterns generated by the colored tiles ordered so as to reflect the clustered gene expression profile of the individual samples [34]. But while this approach used the visual representation of SOM, it still focused on finding subset of genes for classifying tumors. In these cases the SOM maps were used as graphical representation mostly to illustrate a particular algorithm of analysis, much as dendrograms serve to give evidence of hierarchical clustering, but are not actually read by the human eye to obtain specific information. Instead, we propose that SOM displays can be specifically treated as new, complex objects for a next level analysis, namely, visual gestalt recognition. Thus, we do not use computer algorithm in the sense of "artificial intelligence," but more as "intelligence enhancement" for the human brain in the holistic comparison of the transcriptomes.
To enable such an integrative analysis based on visualization of each tumor sample as a unique and complex molecular portrait, we adapted GEDI-a SOM-based tool developed for visualizing the dynamics of genome-wide gene expression profiles [35]-to represent "static" microarray samples as two-dimensional high-resolution SOM mosaics. Using published gene expression profiles from a large set of lung tumor samples [5], we offer a first assessment of the usefulness of this type of holistic visual analysis of tumor gene expression profiles. These studies reveal that human gestalt perception can lead to discovery of novel biological features without a preconceived hypothesis, and uncover new relationships between lung tumor subtypes that had previously escaped the analysis using conventional algorithmic classification techniques [5,36].

GEDI analysis software
GEDI is a bioinformatics software package that was originally developed to visualize multiple parallel time courses of gene expression profiles (or other high-dimensional molecular portraits) experiments [35]. In the currently available version it uses unsupervised machine learning algorithms based on SOM [29] to cluster N genes into K "miniclusters" (or metagenes, see below) and map the results on a twodimensional displays, one for each profile.
The SOMs assign similarly behaving genes to the same clusters k (k = 1, 2, . . . , K) and place similarly behaving clusters in close vicinity to each other on a two-dimensional, rectangular SOM grid of a × b nodes, where a × b = K. Thus, the objects that are being clustered are gene vectors N), where x j i represents the expression value of gene i in sample j (with j = 1, 2, . . . , M). Once the SOM have assigned all the genes into K miniclusters, each minicluster is represented by a metagene vector c k = [y 1 k , y 2 k , . . . , y M k ], (k = 1, 2, . . . , K), where y j k is the centroid value of minicluster k in sample j. To visualize one microarray sample j as one mosaic, GEDI slices the clustered data volume consisting of the bundle of the k metagene vectors c k into slices j across all the K metagenes to create individual SOM mosaics for each sample j. Each mosaic j displays in each of its K tiles k the jth component y j k of each metagene vector c k (see Figure S1 in supplementary information; available online at DOI 10.1155/JBB/2006/69141). The value of y j k is reflected by the color of the tile. Since the SOMs arrange the metagenes on the grid based on similarity of behavior in the various samples, the K tiles collectively create a coherent visual pattern for each microarray j. In accordance with previous usage [12,35], such SOM mosaics that display a characteristic visual pattern for each individual expression profile representing a sample are referred to as "GEDI maps." Each corresponding tile k on each GEDI map represents the same metagene, and hence, the same minicluster of genes.
The original Matlab-based prototype program [35] was redesigned and rewritten in Java to generate a userfriendly platform-independent program with improved performance, stability, and convenient user interface. Results presented here were obtained with GEDI version 3.0 that is freely available to members of the academic community for noncommercial use and can be downloaded via the web (http://web1.tch.harvard.edu/research/ingber/GEDI/ gedihome.htm). This new program version contains a series of added functionalities that facilitate comparison of samples and retrieval of gene-specific information for individual genes that exhibit interesting patterns. These functions include real-time navigation through both the sample and gene dimension to view either a sample or gene as an individual object. With one mouse click, the name, functional annotations, and behavior in sample space of every individual gene can be retrieved directly. The new version also allows multiple result output formats and exposes the internal parameters for expert users to optimize the SOM.

Dataset
Gene expression profile data from normal lung and pulmonary tumor from the previous work of Bhattacharjee and coworkers [5] (http://www.broad.mit.edu/mpr/lung) were used. The data were obtained as Affymetrix array raw image (DAT) files and analyzed, scaled to a target intensity of 1500 using the microarray suite (MAS) 5.0 program (Affymetrix). A total of M = 25 samples were used in this analysis, comprising 4 different tissues: squamous cell lung carcinoma (Sq, n = 6), pulmonary carcinoid (Car, n = 6), small-cell lung carcinoma (SmC, n = 6), and normal lung (Lung, n = 7). Thus the input data matrix for clustering was (N = 12562) × (M = 25).

Preprocessing of data
The N×M data matrix was log2-transformed to obtain a normal distribution of the originally log-normally distributed "signal" values to prevent bias by outlier genes in the clustering. Each sample was standardized to the z-score to further minimize global sample-to-sample variability due to external factors. The resulting value x j i for gene i in sample j was used for further calculations. To avoid bias by explicit selection of genes that can differentiate between the tissues, we present here analysis based on the unfiltered list of 12562 genes. Although GEDI performed well without filtering, a prefiltering step (eg, removing genes that never change significantly in all samples) in general improved the performance of sample clustering, as is the case with other clustering algorithms.

Analysis by GEDI and hierarchical clustering
The data was analyzed using the program GEDI [35] (http:// web1.tch.harvard.edu/research/ingber/GEDI/gedihome .htm) (Version 3.0) and by hierarchical clustering. In the GEDI analysis, 31 × 30 grid configuration of SOM was used, giving rise to 930 miniclusters. For specific parameters, see supplementary information. Hierarchical clustering was performed with the program ClustanGraphics 6.0. (Clustan Ltd, Edinburgh, Scotland; http://www.clustan.com) [37]. The clustering was performed in the "sample dimension," using Euclidean distance as a (dis)similarity measure between globally normalized samples and the "average linkage" method to build the dendrogram [37].
Correlation matrix C gene of size M × M = 25 × 25 is calculated from the original gene expression data matrix of size N × M = 12562 × 25. The entry correlation coefficient r jk between the samples j and k was calculated as where i is the index of the N gene vectors, x j and x k are the mean gene expression values of samples j and k, respectively. Similarly, correlation matrix C metagene of size M × M = 25 × 25 is calculated from the metagenes data matrix of size K × M = 930 × 25 exported from GEDI program.

GEDI analysis of static gene expression profiles
Starting from an N × M matrix of data from the analysis of N genes across M samples (mRNA expression profiles), GEDI transforms each sample's expression profile into a map that contains a visually recognizable color pattern, referred to as a "GEDI map" [35]. These maps are mosaics generated by self-organizing maps (SOMs) (see Materials and Methods). In brief, this was achieved by (i) a moderate reduction of dimensionality with respect to the genes, from N genes into K gene clusters, which are represented by "metagenes," and (ii) by a spatial reordering of these metagenes onto a two-dimensional space represented by an a-by-b grid (with a × b = K) using SOM [29]. Each mosaic represents the gene expression profile of a sample [34]. The "expression values" of the metagenes are the centroids of the corresponding clusters [30] and are displayed as one of the K colored "tiles" in the mosaic. Since the SOMs assign the same metagene to the same tile for each mosaic, they can be compared to each other. Moreover, since metagenes that exhibit a similar behavior with respect to the M samples are placed next or close to each other on the mosaic, the tiles collectively create a coherent pattern on each mosaic that is characteristic for each sample [29,38]. Importantly, in contrast to conventional cluster analysis using k-means or SOM, where typically K < 30 clusters [30], here K is many folds higher than the expected or desired number of biologically significant clusters, and hence, each of the K metagenes can be viewed as representing a "minicluster" of just a few genes (with a typical median of around 10 genes). A minicluster is thus not meant to represent some biologically relevant gene cluster. Instead, the SOM algorithm is used to "pixelate" the expression profile into K pixels and rearrange them, which is why K is required to be high: typically, K = 100 s to 1000 s [12,13]. These miniclusters consequently contain an order of magnitude fewer genes than in conventional gene clustering [30] and are hence more homogeneous, warranting the representation as a metagene. Accordingly, the patterns formed by the metagenes on a GEDI map will be referred to as "metapattern." Based on the characteristic visual metapatterns, GEDI maps allow the direct comparison of the biological samples, as well as immediate identification of biologically interesting groups of genes.

Visual identification of lung cancer types
The genome-wide gene expression profiles of 25 samples of normal lung tissue (Lung) and different pulmonary tumors, carcinoid (Car), squamous cell carcinoma (Sq), and smallcell cancer (SmC) [5], were visualized as 25 GEDI maps, each consisting of a 31-by-30 mosaic, representing 930 miniclusters. Discrete differences in patterns of gene expression between normal lung and tumor samples are immediately detected upon visual inspection of the GEDI maps ( Figure 1). Each sample exhibits characteristic spatial and color patterns, reflecting genome-wide transcriptional behavior of the respective tissue sample. The visual patterns of the GEDI maps of these different tissue samples remained distinct when the analysis was performed with a wide range of SOM parameters and the SOMs were run to convergence (not shown).
Inspection of GEDI maps allows a straightforward classification of the samples into subgroups without the aid of a clustering algorithm, but simply based on the visual differences in the metapatterns. Samples grouped together with members of the same category, with the exception of one outlier, a small-cell lung cancer sample, SmC6 whose GEDI map looked different ( Figure 1). As previously demonstrated [5], hierarchical cluster analysis reliably arranged these lung tumor samples into distinct clusters which corresponded well to the different clusters identified using GEDI (Figure 1).
A known drawback of hierarchical clustering is that the linear arrangement of the clustered objects (samples) at  Figure 2: A hierarchical dendrogram computed from the same data as in Figure 1, but with randomly permutated genes. Mixed tissue types can be seen in the same branches of the tree. On the right, random patterns in GEDI maps from three representative samples of each tissue type.
the terminal branches of the dendrogram can be presented in multiple ways (orderings). This can make the unbiased global assessment of intersample similarity across all the samples difficult. Although this arbitrariness can be eliminated by using a one-dimensional SOM, k-means clustering, or other optimization algorithm to achieve some objective branch ordering [39,40], this method is not often used. By contrast, because there is no a priori clustering structure in GEDI, sample clustering is directly obvious and robust and avoids bias suggestion of relatedness-a known problem with hierarchical clustering. Another shortcoming of hierarchical clustering is that the hierarchical relationship displayed in the dendrograms does not necessarily have a biological meaning. For example, hierarchical clustering forces the randomly permuted data into a tree structure with similar overall structure (albeit with a higher distance score between the branches) even though the "samples" have now random attributes and have no meaningful relation (Figure 2). In contrast, in this case the GEDI maps immediately reveal the poor quality of clustering: the samples that were clustered together by hierarchical clustering do not exhibit any consistent global pattern ( Figure 2). Therefore, GEDI also provides a first-line sample-centered quality control for traditional clustering methods.
Because GEDI maps provide a global view of the gene expression profiles of each sample, they immediately present an explanation for why a particular sample behaves as an outlier (when sample diagnosis is known) and which genes account for that behavior. For example, the dramatic difference between the GEDI map of an outlier, SmC6 (Figure 1), relative to samples within the cluster of nominal small-cell lung cancers immediately reveals that SmC6 deviates from the other small-cell carcinomas and the different pattern of tiles explains why.

Fidelity of GEDI maps in representing tissue transcriptomes
In addition to visually comparing GEDI maps as individual entities, one can extract the numerical centroid values y j k in sample j of each metagene k to analyze GEDI maps quantitatively. By utilizing the metagenes instead of the "real" genes to characterize a transcriptome, the complexity is reduced, in our case from the original data matrix N × M = 12562 × 25 to 930 × 25.
To evaluate the "fidelity" of GEDI mosaic patterns in representing the expression profiles established by all the genes, we calculated the correlation coefficients r jk for every pair of samples ( j, k) using either (1) the expression data for all of the individual genes or (2) the metagenes. If the GEDI mosaic patterns of metagenes faithfully represent the genomewide gene expression profiles, the correlation coefficients for all sample pairs calculated in these two ways will be similar. In fact, the GEDI patterns preserved the correlation between samples obtained from the real gene expression data ( Figures  3(a), 3(b)). The correlation of the values and the ranks of r jk between the two methods were 0.909 and 0.960, respectively.
Interestingly, the values of the correlation coefficients (profile similarity between samples) calculated from metagenes spanned a considerably broader range than those from  the "real" gene expression dataset, as apparent in the histograms of the correlation values (Figures 3(c), 3(d)). This is also manifested in the better "color contrast" of the correlation matrix color map (Figure 3(b) versus 3(a)). Thus, it appears that the discriminating power of this technique using metagenes may be increased relative to standard microarray analysis. The differences in the average correlation between sample pairs within the same tissue groups ("intratissue pairs") and across tissue groups ("intertissue pairs") were considerably larger when metagenes (0.127, 95% confidence interval: 0.109 to 0.145) were used for calculating the correlation, compared to when real genes (0.069, 95% confidence interval: 0.058 to 0.080) are used (Figure 3(e)). It remains to be determined statistically in extended data sets whether metagene-based analysis consistently has a greater discriminating power by using larger test sets of tissue samples for patient groups with established diagnosis.
In summary, the GEDI maps based on metagenes faithfully recapitulate gene expression profiles of the entire gene dataset despite dimension reduction. Thus, the visual patterns capture the real similarity relationships among samples with a high fidelity.

Second-level GEDI maps
To further validate how well metapatterns can represent the transcriptome, we applied a "second-level" GEDI analysis to categorize GEDI maps automatically using the (N = 930 metagenes × M = 25 samples) matrix as input data. For comparison, we also performed a PCA on the original gene data matrix (with sample columns as the "objects" and gene rows as the "attributes"). The second-level GEDI analysis differed from the first-level GEDI analysis performed on the (N = 12562 real genes ×M = 25 samples) matrix in that the objects of clustering were the samples but not genes, and thus a smaller SOM grid was used. Given the discriminatory power of the metagenes, using them as input variables may improve the quality of sample clustering.
The 25 samples were assigned to a 5 × 5 SOM grid according to their metagene expression profiles. In the resulting second-level GEDI map, the tissue samples (the first-level GEDI maps) of the same diagnosis were grouped within the same neighborhood of the map (Figure 4(a)). The map distances from each tumor-specific sample cluster to that of normal lung (Lung) were roughly similar, while among the tumors, the carcinoid (Car) and squamous cell carcinoma (Sq) samples were most distant from each other, with smallcell lung cancer (SmC) in between.
Interestingly, the spatial distribution of these samples in the two-dimensional second-level map was very closely mirrored in the PCA in which the samples were projected on the plane spanned by the two first eigenvectors (Figure 4(b)). There was good agreement even with respect to the relative position of the individual samples within each tumor and tissue type (Figure 4(a) versus 4(b)).
Importantly, such information revealed by the 2D sample plane, be it the SOM grid of the second-level GEDI or the PCA plane, can be directly read from the metapatterns of the GEDI maps. Visual inspection of the GEDI maps readily confirms the notion that Sq2 displays significant feature similarity to the SmC samples based on the fine structure of the patterns of upregulated genes. Specifically, the GEDI metapattern showed that Sq2 lacked the extension of the red areas (highly expressed genes) from the right half into the upper-left quadrant of the GEDI map that is characteristic for the other Sq samples (Figure 4). Interestingly, this group of metagenes that was not expressed in Sq2 contained multiple keratin-related genes, consistent with the squamous cell origin of these tumors. Without the GEDI maps, the samples would be represented by dots in the PCA which would be identified solely by their position in the abstract eigenvector space. Thus, GEDI allows the rapid toggling between gene-centered and sample-centered perspectives, which is an important feature for an integrative yet gene-specific analysis.

Qualitative differences between gene profiles
Like small-cell carcinoma, lung carcinoid tumors are also classified as (low-grade) neuroendocrine tumors [41], while squamous carcinoma appears to be unrelated to this group. However, in both hierarchical clustering as well as in PCA, SmC was closer to Sq than to Car (Figures 1 and 4), which is consistent with the idea that small-cell lung carcinoma may have an epithelial origin [42], but competes with the notion of the common neuroendocrine property of SmC and Car.
To examine this dualism we used GEDI to analyze the relationship between these three pulmonary tumors and normal lung to compare not only by how much but also how each of these tumors qualitatively differed from normal lung tissue and from each other. The GEDI software environment allows the user to easily perform algebraic operations on whole mosaic patterns based on metagene expression values, and for instance to calculate "average mosaics" from a group of samples with the same diagnosis or "difference mosaics" to reveal differential expression patterns between two samples (or averages of two groups). Here we obtained "GEDI difference maps" (Figure 5(a)) by subtracting the averaged GEDI maps of normal lung samples from that of SmC, Car, or Sq, respectively. The red areas in the difference maps indicate genes that were upregulated in these tumors compared to normal lung tissue. The outlined areas on the maps represent four islands (labeled a, b, c, d in Figure 5(a)) that contain the top 5% differentially expressed genes in SmC versus Lung.
These studies revealed that SmC and Sq share a set of features, representing a number of genes located within regions a-c that are commonly overexpressed in both tumors relative to normal lung. This is consistent with the vicinity of these two tumors in the dendrogram (Figure 1) and in the PCA sample plane ( Figure 4); it also is in line with the proposed epithelial origin of small-cell lung cancer [42]. The specific genes represented by metagenes of the islands a-c included growth-related genes (involved in cell proliferation, cell cycle, DNA replication, etc). Such functional enrichment of genes in the "gene islands" underscores the biological meaning of pattern features in GEDI maps.  Interestingly, the SmC samples while globally close to Sq samples, shared with the Car samples the island d, which contained neuroendocrine-related genes (involved in synaptic vesicle, neuromuscular physiological process, etc), consistent with the neuroendocrine nature of small-cell lung carcinoma [41].
This example illustrates how GEDI can extract relationship features that are not revealed by traditional hierarchical clustering or any reduction of sample comparisons to a similarity metric. Specifically, while three islands (a-c) that represent the regions of metagenes upregulated in SmC compared to normal lung also were found in Sq (a, b, c), they were absent from Car. Conversely, the island d that was enriched for the neuroendocrine genes was overexpressed in Car but not in Sq. Thus, the GEDI analysis exposed a novel facet of relationship between the samples with respect to these signature gene clusters: SmC appears to be the union set of the sets of Car and Sq ( Figure 5(b)), sharing the gene cluster d with Car and the clusters a, b, and c with Sq. With respect to these growth-related gene islands, there was essentially no overlap between Car and Sq. Thus, despite the overall higher similarity between SmC and Sq, when considering the subfeature d with the neuroendocrine genes, SmC was closer to Car than to Sq. Such information on a qualitative relationship is lost in conventional clustering dendrograms that reduce relationships to a numerical similarity between two samples [42]. Without an a priori hypothesis, such qualitative relationships are almost impossible to identify in the widely used heat maps, but they immediately spring to eye in the differential GEDI maps.

DISCUSSION
Genome-scale gene expression profiles are not simply highdimensional sets of variables that provide an opportunity for multivariate statistical analysis. Instead, they are the biological manifestation of the constrained dynamics of the underlying complex and hierarchical gene regulatory networks that govern developmental potentials of cells and tissues [27]. Tumors arise from mutational rewiring of this molecular network and therefore, display specific, coordinated deviations from the normal transcriptome patterns.
To visualize coherent, genome-scale alterations of the transcriptome structure, we used here an integrative visual representation for gene expression profiles. As a test example we analyzed expression profiles of three lung tumor samples as a case study. We show that by delegating the actual process of pattern recognition to human gestalt perception in the format of SOM-based GEDI mosaics, interesting features in the relationship between tumor types can be revealed. Specifically, we found that with respect to pathological deviation from normal gene expression, small-cell carcinoma represents the union set of squamous cell carcinoma and carcinoids. Such information on higher-order transcriptome changes, which may be useful for understanding developmental relationships and differences in drug responsiveness between tumor types, spring to eye in the GEDI maps, but would not have been revealed in conventional algorithms without explicitly asking the appropriate question.
Microarray-based molecular profiles are increasingly used to capture characteristic high-dimensional molecular "portraits" to identify diagnostic and prognostic groups in cancer. Most existing methods reduce complex relationships to a numerical value, typically, a distance metric or a visual distance between points in a reduced dimension space. While this is useful for explicitly extracting specific information, these methods may lose potentially useful, unanticipated information inherent in the high-dimensional expression profiles, such as particular higher-order patterns of expression. Similarly, even the search for a multigene signatures [15] instead of a single marker gene to improve discrimination between diagnostic groups may miss some of the distributed ("holistic") information in the profiles. In fact, maximal accuracy of multiclass tumor classification may require that the predictor utilizes all the genes [43].
The GEDI visualization software was developed to circumvent the problem of discarding implicit, potentially irreducible information inherent in genome-wide expression profiles in the absence of a specific hypothesis. It provides the opportunity for a holistic, yet molecular exploration of a set of gene expression profiles (or other high-dimensional data sets) that can be used to test existing tissue-level biological hypotheses [12] or establish new ones. Although GEDI uses a SOM algorithm at its core, it differs fundamentally from the traditional use of SOM to find biologically meaningful clusters [30,32,33,38]. The metagenes in GEDI are miniclusters that are smaller by an order of magnitude than the explicitly predefined clusters in the conventional cluster analysis, hence they are very tight and of high quality. The identification of biological clusters is not the result of the clustering algorithm per se, but is achieved at a later stage of analysis, namely, by visual inspection and gestalt perception of the metapatterns that emerge from the SOM-generated metagenes. Hence, ambiguities in clustering of samples are not built into the algorithm, but are subject to direct and interactive analysis by the interpreter.
GEDI provides several technical benefits relative to existing high-dimensional data analysis methods.
(a) By presenting metapatterns, GEDI maps provide a visual engram of each sample's particular molecular profile, and hence, establish a molecular portrait in the very sense of the word, with a particular visual identity for each sample (eg, tumor type, patient, treatment condition).
(b) Although classification of samples into groups is achieved by human gestalt perception of the metapatterns, it can be supported by an algorithmic approach applied on the metagenes.
(c) The direct visual monitoring of the portrait of a sample allows GEDI to intercept algorithmic idiosyncrasies, such as the dependence of the branching structure of dendrograms on the particular tree-building algorithm used in hierarchical clustering.
(d) Despite a moderate dimension reduction, GEDI preserves most of the information richness of entire molecular portraits, allowing detailed, multivariate explorative comparisons between samples. This in turn can help define qualitative differences (in addition to measuring quantitative dissimilarity between samples) that may provide additional biological information on the relationships between samples.
(e) GEDI allows the rapid and seamless switching between an integrative, sample-oriented analysis and the more traditional gene-centered analysis. This is facilitated by the interactive user interface that permits retrieval of genes that contribute to metapattern features of interest.
(f) Finally, using GEDI to compare the samples and relate them to each other does not require specific knowledge of the underlying algorithm, and thus is an intuitive tool for nonbioinformaticians, such as pathologists and clinicians that will increasingly confront microarray analysis. This is specifically relevant for the explanation of anomalies in cluster analysis, such as outlier samples. The reason for misclassification is usually directly evident in the GEDI map and does not require familiarity with the details of the algorithm. GEDI does not replace, but complements, existing algorithmic clustering methods. Although biologists have begun to use GEDI maps [12,13,44] to ask biological questions, further systematic elucidation of its application, notably, the choice of optimal size of miniclusters is needed. Moreover, it is at the moment not yet optimized for visual discrimination. Other methods to create the mosaics can be envisioned. Future use of GEDI in studies of the genome-scale molecular signature of both normal and disease samples will ultimately help assess the true value of a "holistic" interpretation of molecular profiles that systems biology is advocating.