Som-Based Class Discovery Exploring the ICA-Reduced Features of Microarray Expression Profiles

Gene expression datasets are large and complex, having many variables and unknown internal structure. We apply independent component analysis (ICA) to derive a less redundant representation of the expression data. The decomposition produces components with minimal statistical dependence and reveals biologically relevant information. Consequently, to the transformed data, we apply cluster analysis (an important and popular analysis tool for obtaining an initial understanding of the data, usually employed for class discovery). The proposed self-organizing map (SOM)-based clustering algorithm automatically determines the number of ‘natural’ subgroups of the data, being aided at this task by the available prior knowledge of the functional categories of genes. An entropy criterion allows each gene to be assigned to multiple classes, which is closer to the biological representation. These features, however, are not achieved at the cost of the simplicity of the algorithm, since the map grows on a simple grid structure and the learning algorithm remains equal to Kohonen’s one.


Introduction
High-throughput methods developed in the last decade that allow measuring and recording large amounts of data have offered new directions to biological and medical research. Transcriptional profiling using microarray technology makes possible the extraction of whole-genome expression data, which is in need of suitable analysis tools. Starting from virtually no literature a few years ago, this field has come to dominate many conferences and journals (Kohane et al., 2003).
Genes involved in different biological processes that are being affected as a result of experimental treatment exhibit specific patterns of variation in their expression. Therefore, genes' expression profiles 1 reflect genes responses to a particular set of experiments. Clearly, microarray measurements contain information about many different aspects of gene regulation and function. It is believed, and previous studies confirmed these assumptions (Lieberman, 2001;Lee et al., 2003), that by suitably decomposing the gene expression data, resulting components may correspond to distinct biological functions and the decomposition may help producing a less redundant representation of data.
Methods like Principal Component Analysis (PCA) and Independent component analysis (ICA) are usually employed to represent the original data in a manner that facilitates subsequent analysis SOM-based clustering algorithm for microarray expression profiles 597 by means of a suitable transformation 2 . However, as PCA is only a second-order method (i.e. it finds representations of the data using information only in the co-variance matrix) and constrains the direction vectors to be orthogonal, we can just decorrelate the components. ICA imposes statistical independence on the individual output components, using higher-order statistics (e.g. kurtosis), which means information not contained in the covariance matrix. If the underlying components are Gaussian, it is necessary to use PCA (Gaussian distributions can be completely determined by the information contained in the co-variance matrix), since the ICA model imposes the restriction of non-Gaussianity on all the components (with the possible exception of one). However, for non-Gaussian data, ICA should be employed, since PCA representations may lack important intrinsic information (Karhunen et al., 1997). Recent studies suggest that the distribution of gene expression data deviate from the Gaussian distribution and therefore ICA can be effectively applied in order to reveal biological relevant features (Kreil et al., 2003;Liebermeister, 2001).
In this study we employ independent component analysis (ICA), due to its capability to take into account higher-order dependencies in the data. The linear ICA is a method for the construction of a linear non-orthogonal coordinate system, in which the components are required to be statistically independent. However, it must be noted at this point that many practical implementations of ICA find components with minimal statistical independence (which is the case for our work also). The consideration of higher-order statistics in ICA allows meaningful information within the data to be exploited. Its capability of estimating underlying sources from an observed dataset has been successfully tested in blind-source separation problems, where the source signals are extracted from an observed data consisting of a mixture of the original signals. Since the expression of genes is controlled by a combination of underlying processes that regulate the way cells behave, it is assumed that ICA can be a useful tool for finding a meaningful representation of the gene expression data. The results of Liebermeister et al. (2002) have shown that the retrieved independent components correspond to putative biological processes in yeast cell cycles and may reveal characteristic differences between cell types.
Both PCA and ICA have been previously used in gene expression analysis as either a dimensionality reduction method only (Yeung et al., 2001b) or as a method to obtain a more meaningful representation of the data, and subsequently to analyse the biological significance of its decomposition (Liebermeister, 2001). In addition, simple clustering of the decomposed data was performed, in order to improve the prediction of functional classes for gene expression datasets with incomplete annotation (Lee et al., 2003).
Clustering of gene expression data, i.e. the grouping of genes into biologically meaningful groups according to their expression profile, is a good starting point for the analysis of a genome whose function remains largely unknown at this time. The simple underlying assumption in all gene-clustering techniques is that genes that appear to be expressed in a similar manner (have similar expression profiles) are in fact involved in the same process and share similar biological functions. The corollary to this assumption is that, although genes may distantly affect the function of other gene products, they fall into groups of more tightly regulated biological processes.
Different clustering techniques have previously been applied directly to microarray data (Brazma et al., 2002;Eisen et al., 1998;Mavroudi et al., 2002;Papadimitriou et al., 2001). Although the first clustering methods applied were mostly developed outside biological research, such as hierarchical clustering and K-means, they still revealed biologically relevant information. However, some of their characteristics are not suited for their use in clustering expression data. Perhaps the strongest impact on the final results of some methods is the necessity of specifying the expected number of clusters, as in the case of K-means, as well as the classic self-organizing map (SOM), a number that is almost impossible to predict in advance (Moreau et al., 2002). In the case of greedy hierarchical clustering, major drawbacks are the impossibility of relocating objects that may have been incorrectly clustered at an early stage, as well as solution nonuniqueness (the solution may be dependent on the order of presentation of the data to the algorithm) (Morgan et al., 1995).

A. Dragomir, S. Mavroudi and A. Bezerianos
Recently, many new clustering algorithms have started to tackle some of the limitations of the above-mentioned earlier methods, e.g. the selforganizing tree algorithm (Campos Marcos et al., 2001), but still work needs to be done. Based on our previous work on clustering of gene expression data (Mavroudi et al., 2002), we intended to develop a new method addressing a number of shortcomings of both classical and the newer clustering methods, more specific to microarray data.
We propose to use clustering of the gene expression profiles in the component space resulting from ICA decomposition. After decomposition, the original gene expression data is represented by a new set of variables with respect to a new set of basis vectors. Our method is based on the self-organizing map of Kohonen (2001), since the standard SOM algorithm has a number of properties that render it a candidate of particular interest as a basic framework for building more advanced algorithms for clustering applications. SOMs can be implemented easily, are fast, robust, and scale well to large datasets. They allow one to impose partial structure on the clusters and facilitate visualization and interpretation. In cases where hierarchical information is required, it can be implemented on top of a SOM, as by Vesanto et al. (2000). We modified the standard SOM algorithm in order to account for the following peculiarities of gene expression data, which could not be handled by the basic SOM: 1. The number of gene clusters is unknown, although the SOM requires a predefinition of this number. 2. Closely related to using a predefined number of clusters is the consideration that even genes which are not really co-expressed with other cluster members (and therefore represent noise) are forced to end up in one of the clusters, and thereby hamper the further analysis of the clusters. 3. The outcome of the clustering with the SOM relies completely on the distance between genes and the map nodes. However, it is known that genes with related functions do not always show the same high degree of similarity within their respective clusters, i.e. different functional classes have different degrees of heterogeneity. Some clusters may be more compact, while others may be more spread. Prior information about the functional categories of genes could aid the clustering, but is usually ignored. 4. Genes belong to more than one functional category, although the SOM cannot handle multilabelled data. 5. The number of genes in the different functional categories is unbalanced (e.g. large categories in yeast may contain more than 2200 genes and small categories as few as four genes). This makes it difficult for the basic SOM to map the rare classes into distinct clusters.
It is obvious that the above-mentioned problems are not exclusively SOM-specific, but affect many other clustering algorithms. Thereby their adjustment improves not only over the basic-SOM algorithm (Tamayo et al., 1999) but also over many other existing methods.
There have been several dynamically extendable schemes proposed recently. The dynamic topology representing structures (Si et al., 2000) adaptively grows the number of output nodes by applying a vigilance test; self-organized tree algorithms (Campos et al., 2001;Herrero et al., 2001) are treebuilding algorithms that use unsupervised learning to construct hierarchical representations of the data; the joint entropy maximization approach (Van Hulle et al., 2002) employs kernel methods in a learning algorithm that forms topological maps; Azuaje (2001) uses a two-layer neural network structure that implements fuzzy logic operations in order to achieve a number of pattern-matching and adaptation formations. The growing cell structures algorithm (Fritzke, 1995;Cheng et al., 2001) is based on SOM but replaces the basic twodimensional grid by a network of nodes whose connectivity defines a system of triangles.
Our approach has similarities to the abovedescribed methods, as it is driven by the same principle of expansion based on locally accumulated error. Perhaps the main difference in our approach is that we include prior functional knowledge in order to guide the map expansion and to design the expansion criteria. Of course, we are aware that in molecular biology classes may be defined but are neither complete nor completely trusted. This shouldn't be a reason to completely neglect the knowledge we have acquired so far, but constitutes a reason for avoiding an analysis which would be too tied to the original classes. In contrast to most supervised approaches, which do not give SOM-based clustering algorithm for microarray expression profiles 599 any insight into the internal structure of the data but only classify data into a priori given classes, we are still aiming at class discovery, while taking into account any a priori knowledge.
Comparing with other SOM-based semi-supervised methods in the literature (Sinkkonen et al., 2000;Kohonen, 2001), while these approaches are aimed for the cases where complete and reliable class information exists, we aim to exploit supervised information only when it exists, and thus we confront the problems caused by incomplete and unreliable class labelling of gene expression data. Additionally, we consider the fact that, due to the possible co-expression of genes with different groups of genes in response to the varying demands of the cell, genes do not always belong to a unique class, but may be multi-labelled. Finally, in contrast to the complexity of some of the mentioned schemes, we built simple algorithms that can be implemented easily and the training of the models is very efficient.
There are several aspects in which the current work improves over our previous method (Mavroudi et al., 2002). A substantial enhancement is provided by our choice to make use of the advantages offered by ICA. The linear transform not only produces suitable representations of the gene expression data by filtering out unwanted sources of variation and easing this way further cluster analysis, but also helps to reveal unobserved variables that highlight particular biological factors. Also of great importance is the effective framework provided for reducing data dimensionality, which has a great impact on the computational complexity of the algorithm. Further improvement of our algorithm is related to an efficient use of multifunctional labelling of the data in a more elaborate entropy-based error measure, as well as from the incorporation of a supervised dimension into our map convergence criteria. These provisions are discussed in detail in the following sections, with current results demonstrating the superiority of the present approach and its potential, not only as a clustering but also as a classification device.

Methods
A simple representation of gene expression data would be that of a matrix X ∈ nxm , whose rows represent profiles of n genes and the columns contain the genes' expression measurements over m conditions. Each gene profile has a class label, describing its gene's biological function. Usually the data dimensions are very large (thousands to tens of thousands × tens to hundreds), therefore making it a challenge for the data analysis to find a suitable representation of the data such that the transformed variables give information otherwise hidden in the dataset.
Decomposition refers to a process that transforms the input data space into a space designed in such a way as to allow a meaningful representation of the data, which captures most of the intrinsic information content. Usually, therefore, the input data space undergoes a dimensionality reduction as the projected space is constructed (since redundant information is eliminated). More formally, this means that each m-dimensional data vector x of the original data space can be represented by d numbers, with d < m.

Gene expression with ICA
Using ICA decomposition we can model the gene expression data matrix as a matrix product: where S is the matrix containing the data representation in independent component space. The model represents the original data by new variables (columns of S) or, alternatively, seen with respect to a new basis formed by the rows of the mixing matrix A. The gene expression profiles x i will be represented in component space by the profiles s i (rows of S), which will inherit the class labels from the correspondent x i . The new set of variables, the independent components, have minimal statistical independence between them and may be interpreted as influences of some unobserved variables, as described by matrix A. The task of ICA is to estimate from the observed gene expression data the independent components or, equivalently, to estimate the mixing matrix A. Furthermore, we are specifically interested in extracting most of the information contained in the original gene expression data within just a few independent components. ICA offers an effective framework for dimensionality reduction. The procedure we adopt for truncating the profiles s i in order to keep only independent components that are important in our analysis is presented in Results. Intuitively, similar experiments/conditions from the original gene expression set could be described by a single component in the transformed dataset. After retaining the most informative components, we will use as inputs to our algorithm the truncated profiles s i , which will be referred to from now on as patterns.

Approaches to ICA
One main category of approaches to the ICA computation relies upon batch computations that minimize or maximize some relevant contrast functions. These algorithms suffer from the problem that they require very complex matrix operations. Another category considers adaptive algorithms that are based on stochastic gradient algorithms, implemented with neural networks. These approaches suffer from slow convergence. Moreover, the convergence itself depends critically on the choice of the learning rate parameters and therefore it cannot be guaranteed.
On the other hand, the fixed-point ICA algorithm, presented by Hyvarinen et al. (1997), is an approach that utilizes a highly efficient fixed-point iteration scheme for finding the local extrema of a suitable contrast function that accepts as arguments a linear combination of the observed variables. The detection of the local extrema of this contrast function makes possible the approximation of the non-Gaussian independent components. The fixed-point algorithm works practically for any non-Gaussian distribution of the independent components (Hyvarinen, 1999) and for any choice of the contrast function. The choice affects only the performance optimization. This is very important, since there is little a priori information about the distribution of the gene expression profiles. Additionally, the algorithm in its hierarchical version finds the independent components one at a time, instead of working in parallel like most of the suggested ICA algorithms that solve the entire mixing matrix. This makes it possible to estimate only certain desired independent components .
Before applying the fixed-point algorithm, a preliminary whitening of the data x is performed, a useful preprocessing step in data analysis. The observed vector x is linearly transformed to a vector u = M • x, such that the elements u i are mutually uncorrelated and all have unit variance. Therefore, the correlation matrix of u equals unity: E {u T • u} = I. This transformation can be accomplished effectively with PCA. A benefit of using PCA is that in addition to whitening, it can at the same time perform an effective dimensionality reduction of the data. Specifically, the dimension of the transformed data vector u is reduced to d , where d is the number of independent components that we wish to obtain. Therefore, after the transformation we have: Since the components s i are independent by assumption: It follows that B = M • A is an orthogonal matrix. Therefore, the problem of finding an arbitrary full-rank matrix A is reduced after the PCA whitening to the simpler problem of finding an orthogonal matrix B, from which the independent components are obtained with s = B T • u. The criterion for finding the orthogonal matrix B is to minimize the final objective function J (w) = E {(w T u) 4 } − 3 w 4 , where w T u are linear combinations of the observations u i and with z = B T w, with w = z = 1, as in Hyvarinen et al. (1997).

Algorithm dynamic growing and adaptation
Unlike the standard SOM, which has a fixed architecture, we designed a model initialized with a map that consists of only four nodes, arranged on a rectangular 2 × 2 grid, and then expands automatically in order to properly represent the input data. Each node corresponds to a different cluster, with one or several clusters in the final map configuration representing a class. Weights of the starting nodes are initialized with random patterns from the input dataset. Patterns representing the gene profiles in the independent component space are successively presented to the map, each of them being assigned SOM-based clustering algorithm for microarray expression profiles 601 to the node with closest weight vector, according to a distance measure (named 'winner node') that is consequently adjusted together with its direct neighbours, to represent the patterns mapped. Since the map starts with a much smaller than usual sized SOM, there is no requirement for a large neighbourhood to train the whole map at the first learning steps (e.g. with four nodes initially in the map, a neighbourhood of one only is required). As training proceeds, during subsequent training runs, the area defined by the neighbourhood becomes localized near the winning node, not by shrinking the vicinity radius (as in the standard SOM) but by enlarging the SOM with the dynamic growing.
The learning rule for the adjustment of the weight vector remains the same as in the standard SOM. Specifically, during the adaptation phase the weight vectors of the four nodes in the direct neighbourhood of the winner and for the winner itself according to the following formula: where the learning rate µ ω (t) is an exponentially decreasing sequence of positive parameters, N t is the neighbourhood of the winner node at the tth learning step, (d (i , j )) is the neighbourhood function implementing different adaptation rates even within the same neighbourhood, w j (t) is the distance between the input pattern and its closest weight vector and d 1 2 (the row and column of a node i are denoted by i r , i c , respectively).
The learning rate µ ω (t) typically starts with a value of 0.1 and decreases exponentially to 0.02. These values are chosen as to have a relatively fast convergence without, however, sacrificing the stability of the map (Haykin, 1999). The learning rate should be kept to a small value during the convergence phase, but not allowed to decrease to zero, in order to avoid the network getting stuck to a map configuration with topological defects.
The neighbourhood function (d (i , j )) can be defined with the following simple formula: The expansion of the map is driven by a process that aims to minimize two concepts of error, the quantization error (representing the unsupervised part of the learning algorithm) and the classification error (representing the supervised part). The relative significance of the supervised part is controlled by a parameter r su . More formally, we could state that the learning and the expansion of the map aim to minimize a measure τ E of the form: where N d is the number of nodes and ε i the individual composite error term for node i , defined below: The map expansion procedure is performed by inserting new nodes in the neighbourhood of the node j , with the largest composite error measure after the presentation of all the patterns to the map. Training efficiency and implementation simplicity were motivations for expanding mostly from the boundary nodes (see Figure 1). By these means, depending on the position of the node j , the insertion process follows the guidelines described bellow: 1. If node j is a boundary node, the expansion is performed by acquiring one to three new nodes in the direct neighbourhood of the node j . The weights of the new nodes are initialized so as to preserve the trend of the weights of neighbouring nodes (to respect the weight flow ). Specifically, the weight of the new node is computed as , with w av ,c denoting an average of the weights of nodes in the neighbourhood of the node initiating the expansion. 2. If node j is a near boundary node i.e. a node from which the boundary of the map can be reached by traversing in any direction at most two nodes, a percentage (usually 20-50%) of the weight of the node initiating the expansion is shifted towards the outer nodes. This operation alters locally the Voronoi regions and usually, with a few 'rippling' operations, the respective node's weight is propagated to a boundary node (located nearby). 3. Finally, if the node with the largest error measure is neither placed on the boundary nor near the boundary, the alternative of inserting a whole column of new nodes is used. The rippling of the weights is avoided in these cases, since usually excessive computation times are required before the weights propagate from a node placed deep in the map to a boundary node. The column of new nodes is inserted in the grid respecting the direction of largest total error. Specifically, the sum of the composite errors for the nodes in the grid columns to the right and to the left of the winner node is computed.
The new column of nodes is inserted to the side where the error measure is larger. The initial weight values of the new nodes are assigned with respect to the weight flow of the neighbouring node weights, as described above.
Map expansion, followed by the re-adaptation of the weight vectors, is performed repeatedly until convergence (based on the relative change of the total error measure for all nodes between successive training runs) and expansion criteria are met. These criteria are discussed in the following section. After the final expansion of the map is performed, we proceed to a fine-tuning of the respective map configuration. The procedure is similar to the adaptation phase, with the exception that the learning rate decreases to a smaller value in order to allow fine adjustments to the final structure and to allow a lower convergence threshold to be imposed.

Expansion mechanisms
As introduced above, the map expansion takes into account two kinds of error measures. Minimizing the unsupervised part of the above composite error measure ε corresponds to producing clusters of genes that are similar to each other according to a similarity measure (we prefer to use the Manhattan distance, since it is less sensitive to outliers). This term assures the proper functioning of the algorithm, even for completely unlabelled data. The minimization of the supervised component of the error measure, on the other hand, secures a proper use of the labelling. Minimal supervised error forces patterns with similar labels into the same cluster.

Unsupervised component of the composite error measure
We have modified the local error measure that is commonly used for implementing dynamically growing schemes (Alahakoon et al., 2000;Herrero et al., 2001). The measure we are using is customized in such a way to be less sensitive when a large number of similar patterns are mapped to the same node. The unsupervised term in (5) is, thus, a weighted average local error: where S i denotes the set of patterns x mapped to node i , |S i |represents the number of elements of the set S i , w i is the weight vector of the node i and the Dist operator, the distance metric. The weight factor rc x is given by: and corresponds to the average frequency ratio of the L x classes c l to which pattern x belongs. If a pattern x is unlabelled (i.e. is not assigned to any functional class) rc x is set to rc x = 1. The

SOM-based clustering algorithm for microarray expression profiles 603
reason for choosing the exponent factor −1/2 is to make the errors on the low frequency classes account more (e.g. if class A is 100 times less frequent than B, it is 'amplified' 10 times more). By this means it promotes an additional expansion in the neighbourhood of the node where rare classes are mapped, and consequently it enhances the possibility that rare classes will be mapped on distinct nodes. As a result, the representation of these low-frequency classes is improved. We should note that these classes are usually of most biological significance. Being an average value, ε wale does not increase by the accumulation of a large number of similar patterns to a node.

Supervised component of the composite error measure
Each node i is assigned a classification vector cl with elements cl k , k = 1 . . . N c , where cl k is the ratio of the patterns with functional label k among all the patterns mapped to the node cl k = V k V pattern (N c is the total number of classes, and unlabelled patterns do not contribute to V pattern ). The vector cl is considered as the predicted soft classification and each cl k quantifies the degree of certainty that the node i (and consequently the mapped patterns) is assigned a label k .
We defined an entropy-like parameter, which quantifies the uncertainty of the class label of node i , similarly to an entropy measure (Haykin, 1999). However, since each pattern may have multiple functional labels, for a node i , we may have the situation where the sum N c k =1 cl k > 1. As a result, we cannot interpret cl k as a probability (that the node i is assigned a label k ) distribution and therefore the entropy parameter we defined using this ratios is not a proper entropy. Additionally, we compute for each cl k value a corresponding q k value, such as q k = 1 − cl k , which may be interpreted as a measure indicating the degree in which node i does not belong to class k , so that obviously cl k + q k = 1. The additional parameter q k helps in giving a better account of the class uncertainty within nodes. Then, the entropy-like parameter can be computed as the sum of the 'entropies' of all individual classes: This entropy-like quantity retains properties similar to the entropy. Hl(i) is zero for unambiguous nodes (nodes that contain patterns with the same functional label) and increases as the uncertainty about the class label of the respective node increases. The upper bound of Hl(i) is N c log(2) and corresponds to the situation where all the classes are equiprobable (i.e. the labelling mechanism does not favour a particular functional class).
The effective handling of the multi-labelled data can be explained by a simple example. Let N c = 2 and suppose that 50 patterns are assigned to node i , all of them having as a label both classes, while another 100 patterns are assigned to node j , half of them belonging to one class and the other half to the other. Although in each case there are 50 patterns voting for each class, the quantity Hl(i) will be high for node j (Hl (j ) = 2 log(2)) and zero for node i (Hl (i ) = 0). Thus, it would correctly indicate that node i should be kept as it is, representing patterns which are labelled simultaneously by both classes, and that node j does not unambiguously represent any of the two classes and the map should be expanded in its neighbourhood.
Existing class information is explored in order to resolve better near class boundaries. In other words, since nodes with high entropy correspond to nodes where patterns of different classes are equally mapped, it is intuitively clear that these nodes correspond to regions that lie near class boundaries. This means that by achieving a small entropy term (by adding new nodes near to these regions) proper class boundaries are created.

Balance between the supervised and unsupervised part
As stated before, the r su parameter controls the relative significance of the supervised information in the analysis of the respective dataset. The parameter could be set manually in advance; however, this affects the performance of the algorithm crucially. With r su = 0 we have pure unsupervised learning. The higher the r su value, the more the composite error ε is minimized for configurations that fit better the a priori classification. If the goal of the clustering analysis is to derive class labels for gene patterns with unknown functionality based on already annotated data, the higher the confidence in the annotation of the expression data is, the higher the r su parameter should be set. For sufficiently large values of r su the a priori component dominates completely. Clearly, since in this case the information within the dataset is suppressed, care should be taken when dealing with such situations, especially if the goal of the analysis is exploratory insight of the dataset.
Therefore we prefer to produce multiple models and to introduce a model selection step, in order to select a well-performing parameter automatically. Specifically, we start with a zero-valued supervision parameter to produce the first purely unsupervised model, and then the whole algorithm is repeated each time for an increased r su parameter until the classification performance is near optimal. The classification performance for each r su value is measured with a specialized soft classification performance measure, ClassifPerf(r su ), which is able to handle the multiple labelling of patterns (Sable et al., 1999) and is evaluated as follows: as already mentioned, each node i is assigned a classification vector cl with elements cl k . Each pattern has also an a priori (binary) classification vector, cl pattern . Then, for each class label k of each pattern j mapped to node i , a score is assigned. This score equals cl k if the corresponding label is included in the original class assignment of the pattern j (i.e.cl pattern k = 1) and equals q k = 1 − cl k in the other case (i.e. cl pattern k = 0). Intuitively, a small cl k ≈ 0 for a class that does not appear as a functional label (i.e.cl pattern k = 0) of an input pattern j , is much more a success than a failure, therefore it is considered by a score q k ≈ 0. In this way a total score for each pattern j is computed as: The performance Perf j for each pattern j is then obtained by dividing the TotalScore j with the total number of functional class labels N c : The global measure of the performance Class − ifPerf (r su ) for a given ratio r su is obtained by averaging the Perf j values for all the patterns of the dataset DS, i.e: with |DS| denoting the number of elements in the dataset.
Finally, a well performing ratio r su is selected by using the following criteria: on the curve of classification performance over r su we select the value of r su for which a significant increase of the classification performance is obtained, followed by a relatively levelled part in the plot. The increase of the classification performance with an increasing value of r su is explained by the increased role of the a priori information for the formation of the proper cluster structure. The significance of the levelled part in the region of the plot of classification performance that follows the chosen r su is that further increasing the strength of a priori information does not provide an improved generalization performance. The second criterion that guides the identification of the proper value of r su is that, for the chosen value of the supervised parameter, the number of nodes in the map is considerably smaller compared to map configurations with comparable classification performance (thus yielding a model with small complexity).
In a condensed form, the main steps of our algorithm are: 1. Initialize 2 × 2 map and set r su = 0 (pure supervised learning). Repeat//develop a series of models each of them corresponding to an increasing value of r su . Repeat: 2. Adaptation phase 3. Expansion phase until criteria for map expansion are satisfied. 4. Fine tuning adaptation phase. 5. Save map configuration for the current supervised/unsupervised balance r su . 6. Compute classification performance for current r su . 7. Increase significance of the supervised part (increase r su ) until classification performance ≈ 1.

Map convergence and expansion control criteria
The composite error measure also guides the convergence of the map, with the map assumed to have converged when the relative change of the total error measure for all nodes between successive training runs drops below a threshold value. The setting of the threshold value in the range 0.01-0.02 was indicated by several tests performed on different gene expression datasets. It provides sufficient convergence without excessive computation.
Intuitively, the map expansion should stop when the patterns mapped to the nodes of the map are similar and 'not random' (i.e. unrelated). The problem thus reduces to the definition of similarity between patterns. In our setting, similarity has two aspects, supervised and unsupervised. In the unsupervised case, similarity between patterns is related to the distance between them. In order to treat the problem quantitatively, we set a confidence level α from which we derive a threshold d thr for the distance between patterns, below which two patterns are considered similar. The confidence level α has the meaning that the probability that two patterns allocated to the same node are 'random' (not similar) is lower than α, if the distance between them is smaller than the threshold. Obviously, the definition of a statistical confidence level would be only possible if the distribution of the distances between random patterns was known. Practically, although the distribution is unknown, it is easy to approximate it by randomly shuffling all the patterns' elements. This randomization destroys the correlations between different patterns, while it retains the other characteristics of the whole set (e.g. ranges and histogram distribution of values). In this way we compute an approximation of the distribution of the distance between random patterns. Figure 2 illustrates the distributions of the distances between the randomized patterns and the actual patterns from the component space. In this case, by choosing a statistically common confidence level α = 0.05, we can determine a lower and a upper threshold for the distance between patterns (dL thr and dU thr ) and consider the distances in the interval [dL thr dU thr ] as random. Taking into account that smaller (larger) distance corresponds to a larger (smaller) correlation, we can state that for distances smaller than dL thr a positive correlation between patterns is implied, while a negative correlation is indicated by distances larger than dU thr .
Having determined the distance thresholds, we compute for each node i the ratio of intra-node pairwise distances between patterns which fall beyond the thresholds, i.e.: rd i = #distances beyond thresholds #distances within node i We then compute the sum: where K denotes the number of clusters. Obviously, for the extreme case that every node contains only patterns with inter-pattern distances fall outside these thresholds, RD takes the value RD = 0 (representing the least random state possible), whereas for the opposite extreme case, when the nodes only contain patterns with inter-pattern distances that fall within the thresholds, we have RD = 1 (representing the most random state possible). A purely unsupervised expansion control criterion has been used before in the literature (Herrero et al., 2001;Mavroudi et al., 2002). However, such an approach is reliable only if the distribution of the random patterns was known and not estimated. Also, the definition of randomness in such a case depends only on the similarity of the patterns, i.e. the derived thresholds. Therefore, the similarity thresholds would be the same for all nodes, although there is no reason to believe that the degree of similarity between patterns that belong to the same class is bound to be the same for different classes.
We add to the unsupervised control criterion described above a supervised component, which quantifies the difference between the representation of functional classes in the entire dataset and their representations within different nodes. We would expect that a random distribution would allocate the patterns in such a way that the representation of classes in a given node would be similar to the representation of the classes in the initial dataset. If the different classes represented in the dataset were of equal size (contained the same number of patterns), we would expect a random assignment to result in a uniform representation of all classes and the entropy-like measure would be an appropriate measure of randomness (i.e. a high entropy value The results of the data shuffling illustrate that the distances between randomized data occupy a distinct distribution. The plot concerns the data distributions from dataset 1. dL thr and dU thr are the derived distance thresholds, with similar patterns expected to have inter-pattern distances smaller than dL thr would mean that the distribution is random and a low value that the distribution is not random). However, since most gene expression datasets have a highly unbalanced number of functional classes (with hundreds of genes belonging to a specific functional class, while other classes may be represented by fewer than 10 genes) we prefer to quantify the differences in classes representation in the initial dataset and within each node in terms of a measure based on the Hellinger distance.
The Hellinger distance measures the discrepancy between two probability density functions (pdf) and, unlike other divergences (e.g. Kullback-Leibler divergence), it is symmetric, bounded, and satisfies the triangle inequality (we could say that the Hellinger distance is a measure of affinity between two distributions). In a state space , with ω ∈ being the set of all possible states: Similarly, with the classification vectors defined previously, we define for each node i vectors ϕ i of the ratios ϕ ik = N ik N i , where N ik is the number of patterns with functional class label k (k = 1 . . . N c ) mapped to node i and N i is the total number of labels within node i (i.e. a multi-labelled pattern is counted multiple times, once for each of its labels). Thus, ϕ ik are the observed frequencies of occurrence of patterns with label k . One can interpret the ϕ ik as probabilities, since N C k =1 ϕ ik = 1. Specifically, for each class k , a pattern that is mapped on the node i has a probability of ϕ ik of belonging to SOM-based clustering algorithm for microarray expression profiles 607 the class k and a probability of ψ ik = 1 − ϕ ik of not belonging to the class k , with ϕ ik + ψ ik = 1. Equivalently, the frequency of occurrence of patterns with class label k in the initial dataset is computed as: with N k P being the number of patterns with label k in the initial dataset and N P being the total number of labels in the dataset (again, multilabelled patterns count multiple times). Again, ϕ Data k can be interpreted as probabilities, i.e. the probability that a pattern in the initial dataset belongs to the class k . Then, the distance for each node i is given by: where N C is again the number of distinct classes. The multiplicative factor 1 √ 2 · N C is introduced in order to normalize the distance to the range [0,1]. Finally, we compute the total average distance TD H for all K clusters as: Combining the unsupervised measure RD previously defined and the supervised distance measure TD H we derive a total measure of the form: where β is a parameter controlling the balance between the supervised and unsupervised term. Rand = 0 corresponds to the case where all interpattern distances of the clusters fall beyond the thresholds and the distribution of the class labels of the nodes differ completely from their distribution in the initial dataset. Obviously, Rand = 1 corresponds to the other extreme case, where the inter-pattern distances and the distribution of the class labels are random.
Finally, the expansion criterion is given by: The parameter ε has been determined empirically to 5%: for β = 0 it has the meaning that we consider the map configuration as not random and stops the expansion only if, on average, less than 5% of the patterns which are mapped to the nodes do not fulfil the intra-node threshold requirements. For β = 1 the configuration is considered as not random if, on average, the similarity between the class label distribution of the nodes and the class label distribution of the dataset is less than 5%.
An important observation is that, since the patterns mapped to each cluster fall between certain similarity distance thresholds and have a distinct class label distribution, it follows that patterns that are not tightly co-expressed with other patterns will be mapped to distinct clusters. In other words, nodes that are selected as winners for very few (usually one or two) patterns, termed uncolonized nodes, seem to correspond to genes that lack coexpression with other genes and are probably noisy patterns. Of course, there is also a chance that these patterns are not pure artifacts, but very unique patterns that have potential to provide knowledge. Therefore they are amenable to further study, and this is why we do not delete these nodes from our scheme. However, due to the changes in the weight values of the nodes during the map adaptation, some patterns may be accidentally assigned during the training process to different nodes than those to which they are finally assigned. That is why we mark and isolate only the patterns that are consistently (three times or more during the complete map training schedule -for all r su parameter values -consisting of few hundred training runs) mapped to uncolonized nodes. In contrast, nodes that are not selected as winners for any pattern are removed from the map, in order to keep it compact.

Results
Stanford web site. The two (related) datasets were generated by studying this fully sequenced organism with microarrays containing essentially every open reading frame (ORF). The larger dataset contains 80-element gene expression profiles for 6221 genes and the smaller one consists of 79-element gene expression profiles for 2474 genes. Apart from making use of our method's capability of incorporating supervised information into unsupervised learning and assessing its performance in exploratory data analysis, we studied its (re)classification capability by artificially disturbing the functional classification labels in order to assess the strength of the model as a reclassification device. In another experiment, we investigate the meaningful information arising from ICA decomposition. Additional results proving our method's solid performance in clustering, compared to other classical approaches, are presented in the Supplementary Material (http://www.interscience.wiley.com/ jpages/1531-6912/suppmat/).

Dataset 1
Samples in this set were collected at various time points during the mitotic cell division cycle (60 time points in four different experiments: α-factor, cdc-15, cdc28-based synchronization and elutriation), sporulation experiments (13 time points) and the diauxic shift (seven time points). The sources of these profiles were eight different microarray experiments .
The study of yeast genes during the diauxic shift, for example, were obtained from DeRisi et al. (1997). With a fluorescence ratio method, they measured the relative abundance of mRNA for the entire yeast genome, to examine the changes in expression that take place with the metabolic shift from anaerobic to aerobic metabolism. The levels of expression of genes were measured in seven samples, taken at 2 h intervals, and reflect metabolic reprogramming that occurred during the diauxic shift.
Annotation for the genes in this dataset was derived from the Functional Classification Catalogue of the Munich Information Center for Protein Sequences (MIPS) Comprehensive Yeast Genome Database (CYGD), available at http://mips.gsf.de/ proj/yeast/CYGD/db/index.html. The selected annotation included all the 19 top-level functional Classification not yet clear 115 -Unclassified proteins 2186 - The first column contains the category name, the second column the number of patterns corresponding to each annotation and the third column, the distribution of the class labels in the dataset. * Marked categories are employed in the classification validity experiments.
categories. Table 1 presents the functional categories, the number of gene expression profiles (or, alternatively, the number of component profiles s i ) corresponding to each annotation and the distribution of the class labels in the gene expression dataset.
The gene expression profiles are arranged in a table with rows corresponding to genes and columns to the individual log-transformed gene expression ratios of each gene in a particular experimental condition represented by the column. The weighted K-nearest neighbours imputation method presented in Troyanskaya et al. (2001) is applied in order to systematically fill up the missing values.
The implementation of the ICA model has been accomplished with a fixed-point algorithm, using an implementation that is freely available at http://cis.hut.fi/projects/ica/fastica/fp.shtml. Dimension control by PCA was used and 65 dominant components out of 80 were selected, which accounted for 98.64% of the variance in the Figure 3. Plot of the ordered PCA components obtained from dataset 1. It can be noticed that the first few components contain most of the variance (first 65 components contain 98.64% of the variance in the data) data. In Figure 3 the eigenvalues corresponding to these principal components are plotted along their indices. It can be seen how the 15 removed eigenvalues have actually very low values. We could have excluded even more eigenvalues without a great loss in data variance, but we decided to keep them in order not to lose any valuable information. Removing the less significant principal components helped reducing the computational complexity of further analysis.
Matrix A (see equation 1) was initialized with the first independent component and the deflation approach was chosen. The deflationary estimation of independent components consists in obtaining first one independent component (typically by maximizing a measure of non-Gaussianity) then estimating the second component, discarding the direction of the first one, and so on, repeating the procedure until all the independent components are obtained (Delfosse et al., 1995). Typically this is done by constraining the search for new independent components to the space that is orthogonal to the already found components. We ordered the independent components and selected the first 10 independent components by means of the amount of data variance they retain and by the non-Normality of their distribution, as described by Liebermeister (2002). The robustness of our approach (caused by its SOM-based nature) makes it not sensitive to the exact number of independent components employed. Our choice was driven by the need not to include too many components that actually describe 'noise'. Figure 4 shows the first five characteristic basis functions resulted from the ICA transform (rows of matrix A). It is interesting to observe that each of them seems to capture information that corresponds to a different subgroup of experiments. The three upper basis functions are more active during the cell-cycle experiments. The fourth shows an increased response during the sporulation and diauxic shift experiments, while the fifth is mainly active during the sporulation experiments.
Patterns having as elements the coefficients of the ten 'most important' independent components selected above were then used as input to our map. Therefore the input data matrix contained 6221 patterns of 10 elements (the representation of original gene expression profiles in the component space defined by the selected independent components). The algorithm selects for this dataset the model with a supervision parameter value r su = 0.6. It  Spellman et al. (1998), the sporulation experiment (13 time points) was performed by Chu et al. (1998), while the diauxic shift experiment was performed by DeRisi et al. (1997) corresponds to a model with an acceptable classification performance (ClassifPerf = 0.65), small model complexity (map with reduced number of nodes) and a small rate of increase of the classification performance for further increased values of r su (see Table 2). Our approach does not set out to enforce a perfect classification but rather to perform exploratory analysis while making use of the existing supervised information. The relatively low rate of increase in the correct classification rate on the performance curve in Figure 5 indicates that further increasing the influence of the supervised information above the chosen r su value does not result in a significant increase of the classification performance, which may be due to overfitting. The SOM-based nature of our method enables data visualization during the learning process. It should be noted that increasing the amount of supervised information used during learning (increasing r su ) does not necessary mean that the number of nodes should increase. In some cases supervised information helps in producing a more compact map.

Exploratory data analysis experiment
In order to test how much the incorporation of a priori class information helps exploring unlabelled data, we unlabel 30% of the data in dataset 1 and arbitrarily split the 30% in two equal subsets (subsets T and V, each containing 15% of dataset 1 ). We monitor the classification performance for subset V in two cases: when both T and V are unlabelled and separately, with the labels of T added back. In both cases, the remaining 70% of the data is labelled. The difference in the percentage of labels of V correctly induced shows how much the incorporation of a priori class knowledge (about T) helps in the exploratory analysis of V. The procedure is repeated for 50 random subsets and results of the experiment are presented in Table 3. As expected, with increasing values of the r su parameter, the effect of adding labels to the subset T is stronger. Overall, the Figure 5. Selection of an appropriate r su parameter, corresponding to the model that makes most effective use of the supervised information. The classification performance should present a reduced rate of increase with values of r su higher than the chosen one results prove the utility of our method as an exploratory analysis tool.

Classification validity experiments
Another set of experiments we performed on dataset 1 aimed at investigating the classification validity of our approach. Specifically, we selected six of the 19 top-level functional categories of the entire dataset, as marked in Table 1. A total of 2262 genes belong to these categories and the patterns corresponding to their representation in independent component space was used in subsequent analysis. Our selection objective was to include categories containing different number of genes (Metabolism contains 1059 genes, while just 30% of the data in dataset 1 is unlabelled and split in two equal subsets, T and V, while the other 70% is left as in the original dataset. Subsequently, classification performance on the patterns from subset V is evaluated in two settings: with both subsets T and V unlabelled; or with only subset V unlabelled. The figures represent ratios of patterns belonging to subset V that are correctly classified, averaged over 50 random experiment reruns and their respective standard errors (we compute the standard error as the standard deviation in the sample divided by the square root of the sample's size).
59 genes represent the Signal Transduction Mechanisms category) such as to test our approach's performance when confronted to categories of a wide range of sizes. The decision to work on a subset of the data is motivated by the reduction of the large computational requirements and by obtaining a better control of the evaluation process. We induce randomly a small number of errors in the original labelling and test to which extent the unsupervised drive of our algorithm can recover the original labelling. The incorrect labelled patterns should be assigned to a node of the correct class by the algorithm, if the unsupervised learning is predominant. Specifically, in the case of the categories presented above, we alter the labelling of around 10% of the patterns in each category and we monitor our algorithm's capacity of recovering the original classification by means of unsupervised learning. We repeat the procedure on random partitions of the set in order to obtain statistical confidence (the results in Table 4 are obtained for 50 repetitions). Indeed, the experiment shows that for small values of r su the original labelling is recovered, while increasing the contribution of the supervised information leads to patterns being assigned to the classes indicated by the wrong labelling.
Furthermore, Table 5 presents classification results of our approach (denoted as SOM-IC) compared to those of established methods, as well as our previous method (sNetSOM) from Mavroudi et al. (2002). As expected, the support vector machine (SVM) obtains the highest classification performance; however, our approach yields results competitive to other algorithms. All the algorithms were tested with a 10-fold cross-validation technique (Yeung et al., 2001 a). Again as expected, Class recovery ratios represent the percentage of successful label corrections performed on the mislabelled patterns. The results are averaged over 50 random repetitions and next to them are presented the respective standard errors (we compute the standard error as the standard deviation in the sample divided by the square root of the sample's size). It can be noticed that for small values of r su the algorithm recovers to a higher degree the original labelling, while high values of r su force the use of altered labels.
the experiments testing our approach's handling of pure unsupervised problems, both using the original gene profiles and their representations in the component space.

Dataset 2
The dataset consists of expression profiles of 2476 yeast genes samples at different time points and during eight experimental settings (Brown et al., 1999), in total 79 measurements for each gene. Six functional classes are contained in the dataset and they were obtained again from the MIPS Yeast Genome Database. The first five classes [tricarboxylic acid cycle (TCA), respiration chain complexes, cytoplasmic ribosomes, proteasomes and histones] represent biological categories of genes and are therefore expected to present a similar expression variation. The sixth (helix-turn-helix proteins) is not a functional class but genes included in this group have in common that they code for the helix-turn-helix structural motif; in this analysis they are used as a control set. We evaluated the dispersion of class representation over the nodes of the final map configuration by measuring the entropy of class representation. We expect this measure to be high in the case of helix-turn-helix class, due to the heterogeneity of the respective gene patterns. The relatively high entropy values for the TCA and Respiration classes may be explained by their less homogeneous structure (the fact that these classes The first column contains the functional classes, the second column contains results from analysis performed with patterns from the independent component space, while results corresponding to analysis performed directly on gene expression data are presented in the third column. As expected, the entropy measure for the helix-turn-helix is high, reflecting the heterogeneity of the patterns representing this class. accumulate patterns belonging to other functional classes is caused by the low degree of similarity among their patterns). As in the previous experiment, we have run the algorithm both on the original gene expression data and on the ICAderived patterns. The results obtained are presented in Table 6. The higher entropy values obtained for the map structure corresponding to the analysis performed directly on gene expression profiles suggest a higher dispersion of class representation and therefore less homogeneous clusters. However, the high degree of supervised information employed may be, as in the experiment above, the cause for the relatively low difference in entropies for the algorithm running on the two types of data. The map structure studied is the one corresponding to the model with an optimal value of r su = 0.6.

Conclusions
We have introduced a self-growing adaptive network, which aims to overcome the drawbacks of current clustering algorithms for gene expression data by swiftly integrating unsupervised and supervised learning. The algorithm adaptively determines the number of clusters with a dynamic expansion process based on the principle of local error accumulation. It starts with a small number of nodes and grows to represent the input data. The expansion algorithm prefers to grow new nodes in the neighbourhood of boundary nodes, with a mechanism for whole-column insertion being provided in order to deal with the cases when large maps need to be expanded from a node deep within its interior. The level of expansion is determined automatically, based on a statistical confidence level of non-randomness chosen by the data analyst. The method builds on our previous work (Mavroudi et al., 2002) by efficiently incorporating independent component analysis in a drive to find a more meaningful representation of the data. The method yields an interesting outcome, highlighting particular biological processes as well as representing the data in a biologically sensible way. Also, compared to our previous work, the convergence criteria are improved and more thoroughly defined. Comparative evaluation of the current method as a reclassification device proves it to be superior to our previous approach, with both the incorporation of ICA into our analysis and the enhancements brought to the SOM-based algorithm contributing to this improvement.
Our approach efficiently uses supervised information, whenever available, and is able to deal with incomplete or unreliable functional labelling. Additionally, the method allows enhanced representation of rare classes and successfully handles multi-labelled gene expression profiles.
The resulting clusters are enriched with functionally related genes. Of course, since we use functional information in our supervised learning, this is not surprising. We have to stress though, that the purpose of our algorithm is not to classify the gene patterns according to existing classification, but to include the existing knowledge, which is known to be evolving and incomplete, in order to cluster (reclassify) the genes and to assign a function to unlabelled genes. At the same time, our method proves to be a reliable tool in exploratory analysis, the presented results underlining the positive effect of incorporating a priori class knowledge when exploring unlabelled data.
In future work we intend to further analyse our results from a biological perspective. Especially, we intend to investigate whether, on distinct subsets of clusters that all correspond to a particular top-level functional category, we possibly get a mapping of genes that belong to distinct functional subcategories. Furthermore, it is known that genes are co-expressed with different groups of genes, each group being controlled by a distinct regulatory mechanism, in response to different environmental conditions of the cell (Gash et al., 2002). We intend to uncover the characteristic 'features' of each cluster (i.e. the experiments on which the grouping of genes is primarily based), in order to elucidate the relationships between gene classes and experimental conditions. In this way we expect to obtain biological knowledge about multi-labelled genes, e.g. to uncover the reasons that cause the same gene to be involved in several processes.