Semisupervised Clustering by Iterative Partition and Regression with Neuroscience Applications

Regression clustering is a mixture of unsupervised and supervised statistical learning and data mining method which is found in a wide range of applications including artificial intelligence and neuroscience. It performs unsupervised learning when it clusters the data according to their respective unobserved regression hyperplanes. The method also performs supervised learning when it fits regression hyperplanes to the corresponding data clusters. Applying regression clustering in practice requires means of determining the underlying number of clusters in the data, finding the cluster label of each data point, and estimating the regression coefficients of the model. In this paper, we review the estimation and selection issues in regression clustering with regard to the least squares and robust statistical methods. We also provide a model selection based technique to determine the number of regression clusters underlying the data. We further develop a computing procedure for regression clustering estimation and selection. Finally, simulation studies are presented for assessing the procedure, together with analyzing a real data set on RGB cell marking in neuroscience to illustrate and interpret the method.


Introduction
Regression and clustering are probably two of the most important statistical data mining methods used in practice including artificial intelligence and neuroscience. However, regression clustering, a data mining method integrating the two, has rarely been studied as a single entity despite its great potential for practical use. It is, thus, the intention of this paper to focus on statistical estimation, selection, and computing of regression clustering. In this section, we briefly review cluster analysis but not the familiar regression analysis and then introduce the regression clustering problem.
(1) Cluster Analysis. Cluster analysis is an important unsupervised statistical learning and data mining technique for clustering homogeneous observations from data. Its main objective is to divide a collection of data points, often of multivariate nature, into subsets or "clusters" such that observations within one cluster are more "similar" (homogeneous) to each other than to observations in different clusters.
Cluster analysis is usually used in situations where clustering information is not observed on the data points and one wants to get this information from the data to explicitly group them.
Many approaches have been developed in cluster analysis, which in general fall into two categories: hierarchical and partitive. A hierarchical approach proceeds by either a sequence of "agglomerative" stages or a sequence of "divisive" ones. At each agglomerative stage, clusters are produced by merging or retaining the clusters produced at the immediate previous stage, where clusters at the initial stage may simply be taken to be those individual data points. Contrarily, at each divisive stage, clusters are produced by splitting or retaining the clusters produced at the immediate previous stage, where one may assume a single cluster containing all the data points at the initial stage. The key feature of a hierarchical approach is that clusters obtained at one stage are derived from those in the immediate previous stage. On the other hand, partitive approaches refer to those nonhierarchical ones which may be further classified according to other features of clustering.   The outcome of a hierarchical clustering is often represented by a graph called dendrogram in which each stage of merging or splitting is determined by optimizing some similarity or dissimilarity criterion. A significant drawback of hierarchical clustering methods is that the divisions or fusions, once made, are irrevocable. That is, when an agglomerative algorithm has joined two objects into one cluster, they cannot subsequently be separated, and when a divisive algorithm has made an unwanted split, the objects involved can no longer be recombined into one cluster. Kaufman and Rousseeuw [1] comment on this as follows: "A hierarchical method suffers from the defect that it can never repair what was done in previous steps." In contrast, a partitive clustering constructs a fixed number of clusters often by an iterative procedure. It imposes two requirements in the procedure: (i) each cluster must contain at least one object and (ii) each object must belong to exactly one cluster. In addition, the number of clusters constructed stays fixed during the iterations and an initial partition is required to start the iteration. At each iteration, a tentative partition is constructed by relocating the data points to optimize a conditional criterion. This procedure continues until certain convergence or stability of partition occurs. Commonly used partitive clustering approaches include thosemeans type of methods: -means, -modes, -medians, and -medoids [2,3]. New developments in this regard can be We present an example here to illustrate the use ofmeans method for clustering. The example uses the wellknown Iris data from Anderson [6] which was analyzed by Fisher [7] and many others. The data give the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of the 3 species of Iris: setosa, versicolor, and virginica. The data which can be retrieved from statistics package R [8] are displayed in Figure 1, where we see the data of sepal length and width and petal length and width distributed in clusters. So we use the -means algorithm of Hartigan and Wong [3] to find a partition of 3 clusters for the data and compare the partition with the species information given. The computing is done in R with a random initial partition determined by set.seed(123). The result is summarized in Table 1, from which we see a perfect match between cluster 1 and species setosa and some mismatch between clusters 2 and 3 and species versicolor and virginica.
(2) Regression Clustering. In this paper, we will focus on regression clustering, a data mining method which iteratively clusters data into clusters according to the available regression pattern and then updates the regression in each cluster simultaneously until equilibrium is attained. It is commonly known that regression is for studying the relationship between a dependent variable and a set of explanatory variables which have observations on a sample of objects. If the samples come from different populations and the variable indexing the populations also has an effect on the dependent variable, the regression should be performed on individual populations separately through the corresponding subsamples observed, or by including the population effect in the model, in order to make valid or more reliable statistical inference. However, the population indexing variable sometimes is not observed or unobservable. In such situations, it is necessary to cluster the sample objects to conform to their respective populations as much as possible and then apply regression to each cluster. We refer to this procedure as regression clustering if our focus is clustering the data points or as cluster regression if it is studying the unobserved regression patterns in the data.
Before getting into details of regression clustering, we review various measures of similarity or dissimilarity used in general cluster analysis. Note that to identify possible clusters of observations in data it is essential to be able to measure how close or how far individual data objects are to/from each other. Current measures include the single linkage (nearest neighbour) and complete linkage (further neighbour) (cf. [1]) and the -means. These are usually considered as descriptive since they do not involve any probability distribution and use only descriptive statistics as the measures of similarity or dissimilarity between observations. An obvious disadvantage of using a descriptive measure is one cannot make statistical inference on results of clustering; thus, one is not able to assess the variability involved in the results. To enable making statistical inference, probability distributions or models are postulated for the clusters of data, and it is deemed that data in the same cluster have the same probability distribution. Hence, the similarity or dissimilarity measures to be used are assigned a probability distribution, and the significance and variability of clustering can be readily derived. Probability model based approaches can be applied in both hierarchical and partitive types of clustering. We choose to use the probability model for partitive regression clustering here.
Note that there is no absolute boundary between descriptive and probability model based clustering methods. Some clustering methods were heuristically motivated, but later on, statisticians studied their performance from a probabilistic perspective. For instance, MacQueen [2] and Pollard [9] studied the asymptotic behaviour of -means using a probability model based approach; Hartigan [10] and Wong [11] investigated the mathematical relationship between highdensity clusters and the single-linkage clustering method.
Consider a finite set of objects O = {1, . . . , } together with data z 1 = ( 1 , x 1 ) , . . . , z = ( , x ) ∈ R +1 being the observations of these objects. The problem of regression clustering is to recover the latent partitioning Π = (C 1 , . . . , C ) of O so that the relationship between ( 1 , . . . , ) and (x 1 , . . . , x ) can be studied by regressions on C 1 , . . . , C separately. A probability model based clustering approach assumes that the observed data z 1 , . . . , z are a sample of respective random vectors Z 1 , . . . , Z that belong to a set of populations indexed by Π. Thus, those Z with ∈ C , = 1, . . . , , have the same probability distribution. The specification of the probability distributions can be either parametric or nonparametric.
(3) Organization of the Paper. In Section 2 we provide a detailed formulation of regression clustering including modeling, parameter estimation, and partition determination. In Section 3 we present two procedures for estimating the number of clusters in cluster linear regression. In Section 4, a pointwise iterative assessing algorithm is developed for implementing the regression clustering procedures. A simulation study and an example are presented in Section 5. A real data example on RGB cell marking clustering is analyzed in Section 6. The paper ends with a Conclusion section.

Regression Clustering Model and Optimization
Regression clustering becomes very useful when one intends to recover or estimate the unobserved class-specific regression hyperplanes based on the sample data of dependent and explanatory variables. Note that the notion of hyperplane used here is a generic one, which means it does not necessarily pass through the origin in the space. It should be more correctly called an affine set. But we do not distinguish them in this paper. For cluster regression or regression clustering problem, the data have the form ( , x ), = 1, . . . , , where x ∈ R is an explanatory column vector and ∈ R is a random dependent variable for the th object. The probability distribution of x does not provide any information on regression hyperplanes; thus, our statistical inference will be made conditional on the observed x . In other words, we can simply treat x as nonrandom. As in the general setting of probability model based cluster analysis, there are two different approaches for regression clustering. One is the random partition or soft partition approach in which each data point is assigned a nonzero probability to fall into any of the clusters or equivalently follows a mixture probability distribution. The discussion can be found in DeSarbo and Cron [12] and Quandt and Ramsey [13], among others. Another one is the fixed partition or hard partition approach in which each data point is assigned a cluster membership or label through certain optimization procedure, so a data point belongs to only one cluster. As discussed in Bock [14,15] and Späth [16,17], the probability distribution or classification likelihood function of a data point in a fixed partition approach of regression clustering, with an unknown partition Computational Intelligence and Neuroscience where in many situations we can assume (x , ) to be a normal density with mean x and variance 2 . This is equivalent to describing the data by a group of linear models: Since the partition Π is unknown and the number of possible such partitions depends on , the model (2) is a nonparametric one. Actually, it can be proved that the total number of nondegenerate partitions of form Π is equal to the Stirling number of the second kind ( , ) = (1/ !) ∑ =0 (−1) − ( ) ; confer Tomescu [18]. Also, the linear regression function in (2) can be extended to a nonlinear one including spline and local polynomial regression and so forth under this regression clustering setting. This extension will not be pursued in this paper. Further, the true distribution of need not be the normal. Namely, we use (0, 2 ) only as a "working" distribution for . Then the corresponding least squares or maximum likelihood approach becomes the quasi-likelihood one that still possesses many optimality properties (cf. chapter 9 of [19]). We resort to using a robust approach in this paper instead to deal with the violation of normality assumption.
Given the regression clustering model introduced above, we need to estimate the parameters ( , 2 ) =1,..., and find the best partition Π together with for application. Optimal parameter estimation and partition can be achieved using the maximum likelihood principle, while finding the optimal can be done based on an information criterion. The latter will be explained in next section. Now, we proceed to do parameter estimation and partition.
Under the fixed partition model (2), the log-likelihood function is given by It is clear that the best estimates of the parameters and the partition should be those maximizing the log-likelihood (3) for given . However, the number of possible partitions ( , ) is astronomic even for moderate and ; for example, (20, 3) = 580, 606, 446 and (50, 3) ≈ 1.2 × 10 23 . Therefore, it is almost impossible to find the global optimal partition by enumeration. Here, we propose an iterative estimation method to find local optimal estimates of ( , 2 ) =1,..., and Π for a given . This method extends the exchange method of Späth [16,17].
Then, loĝis monotonically increased if the steps (4) and (5) are carried out alternately. This procedure leads to a local maximum in finitely many steps. It is expected to be a good approximation of the global maximum if an initial partition is properly chosen. In practice, we often assume that the variance parameters 2 , = 1, . . . , , have a common value 2 and estimate 2 by a pooled estimator. This modification tends to return a more robust partition than otherwise.
Note that the work in this section so far can be extended to multivariate regression clustering without any theoretical difficulty. The essential difference between multivariate regression and multiple regression is that the former has a vector response variable while the latter has a univariate one. Hence, (3) to (5) and the relevant ones in the rest of the paper can be easily modified to incorporate the vector response variable, from which it is ready to perform multivariate regression clustering. We will not get into the technical details involved but will provide a real data example in Section 6 to perform multivariate regression clustering.
It is well-known that the least squares method is very sensitive to outliers and violation of the normality assumption in the data. Robust methods can be developed to overcome this vulnerability. Among them, procedures based on -estimation are considered here. -estimation can be regarded as a generalization of the maximum likelihood estimation. A particular one is the maximum likelihood estimation based on Huber's least favourable distribution, whose density function is the normal at around the origin and the exponential in the tails. Using Huber's -estimation method, we can drop the assumption ∼ (0, 2 ) in (2) and estimate by minimizing ∑ ∈Ĉ ( − x ) for given partitionĈ , = 1, . . . , . Here, (⋅) is Huber's discrepancy function defined as where is determined by the scale parameter in Huber's least favourable distribution. We find that assuming a constant scale parameter across all clusters tends to give better robust results, so we adopt this assumption in this paper. Now for given estimateŝ, = 1, . . . , , each data point is assigned or reassigned to clusterĈ = arg min 1≤ ≤ ( −̂x ). At this point, it can be seen that, instead of loĝ, the function the above two -estimation steps are carried out alternately. This gives a robust counterpart of the likelihood-based local optimal estimation and selection introduced earlier in this section.
To conclude this section, note that the fixed partition approach has a particular advantage over the random partition one in the context of regression clustering or cluster regression. As observed by Hennig [20], the mixture probability model involved in random partitioning presumes implicitly an assignment independence of each object to clusters with respect to the covariate vectors x . That is, the clusters keep the same proportions { , = 1, . . . , } for every fixed covariate vector x . In other words, the probability of a point ( , x ) to be generated by cluster is independent of x and . This is generally not true as shown in Figure 2, which is adapted from Hennig [20]. On the other hand, the fixed partition model (2) supposes that the cluster membership of each object or cluster labels are explicitly parameterized and are determined by the estimation of̂and̂2 through the points ( , x ) , ∈ C . Hence, the fixed partition model does take care of the problem of possible assignment dependence between the th object and the associated covariate x . In principle, the random partition approach can be generalized to account for the assignment dependence, for example, by allowing { , = 1, . . . , } to depend on x . But the resultant probability model will be much more difficult to be analyzed both algebraically and numerically; and no such study can be found in literature so far to our knowledge.

Estimating the Number of Clusters
The number of clusters to be used in regression clustering is normally unknown so it should also be estimated. In this section we provide two procedures for estimating the number of clusters, one based on least squares estimation and the other on robust -estimation.
We use a more detailed notation O ( ) = {1, 2, . . . , } to denote the data objects which have observations ( 1 , x 1 ), . . . , ( , x ) as described in previous sections. Recall that these objects are assumed to be a random sample coming from a structured population, which consists of a fixed (but unknown) number, say 0 , of subpopulations, each of which is characterized by a regression hyperplane with class-specific unknown parameters. Therefore, for the observations from this population, there exists an underlying partition Π ( ) where is an × design matrix in the cluster O , e O is an -vector of random errors, is an × identity matrix, and = |O | for = 1, . . . , 0 . Here, ( 0 , ) ∈ R ×R + , 1 ≤ ≤ 0 , are 0 unknown parameter vectors; and 0 , 1 ≤ ≤ 0 , are assumed to be distinct from one another. It is clear that = 1 + ⋅ ⋅ ⋅ + 0 . In the following, we assume that 0 ≤ , where is a known positive integer. Note that in (7) we have suppressed the in O ( ) for simplicity of presentation. Also note that the normality assumption for the random errors e O , although reasonable in many situations, is just a "working" distribution and not really required for applying the least squares method.
In order to estimate 0 , we fit a regression clustering model to the data for each ≤ using the methods developed in Section 2. A criterion function of can be obtained from the cluster regression fitting. Then 0 is estimated as the minimizer of the criterion function. Shao 6 Computational Intelligence and Neuroscience and Wu [21] have used this idea to develop an informationbased criterion for estimating 0 . Let Π ( ) = {C ( ) 1 , . . . , C ( ) } be an arbitrary -cluster partition of the observations. Shao and Wu's information criterion is defined as where ( ) is a strictly increasing positive function of , is a sequence of positive constants,̂are least squares estimators, and ‖⋅‖ is the Euclidean norm. Typically, ( ) = and ∝ log( ) or ∝ log log( ) are chosen. Then̂, the estimate of 0 , is the integer that minimizes this criterion, that is, It can be seen that in (8) the first term is the sum of residual squares which measures the goodness of fit of the model and the second term is the penalty for overfitting. Moreover, the criterion (9) shows that one determines the optimal number of clusters and the corresponding partitioning simultaneously. We shall call (8) together with (9) Criterion LS-C in the sequel, which stands for clustering by the LS method. Under some mild conditions, it is shown in Shao and Wu [21] that the proposed Criterion LS-C selects the true number of regression hyperplanes with probability one among all class-growing sequences of classifications, when the number of observations from the population increases to infinity.
Concerning the robustness of regression clustering, one can use a robust criterion to estimate the underlying number of clusters 0 , where we assume that each cluster O ≜ { 1 , . . . , } ⊆ O ( ) is characterized by a linear model: with the random error ,O not following any specific distribution contrary to that in the linear model (7). In particular, Rao et al. [22] have developed the following robust information criterion function for estimating 0 : where is Huber's discrepancy function and̂are theestimators described in Section 2 or equivalently satisfying It can be seen that similar to that in (8) the first term in (11) is a generalization of a minimum negative log-likelihood function derived from Huber's least favourable distribution, and the second term is the penalty for overfitting.
Using (11), the estimatêof the underlying number of clusters 0 is the one satisfying We shall call (11) together with (13) Criterion RM-C, which stands for the clustering based on robust -estimation. Similar to Criterion LS-C, Criterion RM-C implies that one determines the optimal number of clusters and the corresponding partitioning simultaneously. In Rao et al. [22], it is shown that the true clustering and the associated regression hyperplanes are attained with probability 1 by RM-C when increases to infinity and under certain mild conditions. In particular, normal distribution assumption is not required for the random errors in each regression cluster.

Pointwise Iterative Algorithms for Regression Clustering Estimation, Partition, and Selection
Computing algorithms can be written to implement the regression clustering methods described in Sections 2 and 3.
Recall that in the methods we first estimate the optimal partition Π = {C 1 , . . . , C } and the regression parameters simultaneously by minimizing certain within-cluster sum of residual squares sums (SRSS) or alike for each fixed . The quantity to be minimized is equivalent to for LS regression clustering or sum of robust residual squares sums (RRSS) for an -estimation based robust regression clustering. Only local minimization results can be guaranteed here. We process this local minimization for each candidate and use Criterion LS-C or RM-C to determine the best . The whole procedure can be accomplished according to the following algorithm: (ii) Set = + 1, and reset = 1 if > . Identify C such that ∈ C . Then move into C ℎ with ℎ = 1, . . . , and ℎ ̸ = , respectively. For each of these − 1 relocations, refit the model by regression clustering (or robust regression clustering) and calculate the sum of the Computational Intelligence and Neuroscience 7 residual squares sums (or RRSS) accordingly. Denote the smallest one by SRSS ℎ or RRSS ℎ . If SRSS ℎ < SRSS 0 (or RRSS ℎ < RRSS 0 in robust procedure), redefine C = C − { } and C ℎ = C ℎ + { }, and set SRSS 0 = SRSS ℎ (or RRSS 0 = RRSS ℎ ). Otherwise, return to the beginning of (ii).
(iii) Repeat (ii) until the objective function (14) or (15) does not decrease any further, which means that no observation relocation is necessary and the optimal clustering is achieved for this .
(iv) Proceed with (i) to (iii) for each candidate and use the Criterion LS-C or RM-C to find̂, the optimal number of clusters.
It is important to use a good initial partition of {1, . . . , } in running steps (i) to (iii) so that the global minimum of (14) or (15), or its good approximation, can be achieved. We propose to generate the initial partition of a dataset using the following algorithm which we find works well in practice: (I1) Consider the linear model Based on the whole dataset, one estimates by a robust method, for example, least median regression or least trimmed squares method [23]. Note that a random seed is implicitly used in such robust methods.
(I2) Put into set 1 those data points whose distances to the regression hyperplane estimated in Step (I1) are less than a predetermined number, say . If | 1 | and | 1 | are both larger than a predetermined integer, say , set ℓ = 1 and go to the next step; otherwise, set ℓ = 0 and go to Step (I5). Here, 1 is the complementary set of 1 .
One can adjust the values of and either in advance or adaptively to get an initial partition of clusters for any given . For example, set to the integer part of 0.5 / and to the best value such that two clusters can be obtained in (I2).
The above initial partition algorithm gives essentially an iterated hierarchical binary clustering method, where each binary clustering is realized through resistant regression such as the least median regression. The resistant regression is robust, having high breakdown threshold; thus, although not being fully efficient, it is highly likely to produce a reasonable initial partition through its iterated executions.  The two algorithms consisting of Steps (I1) to (I5) and (i) to (iv) may be named IPARC to reflect the iterative pointwise assessing nature in regression clustering.

Example and Simulation Study
In this section, we first apply regression clustering to the Iris data and provide a brief guideline on when to use the method properly. We then present a simulation study to assess the finite sample performance of Criteria LS-C and RM-C.

The Iris Data Example.
Recall the Iris data that we analyzed using the -means method in Section 1. Now we want to use the regression relationship between sepal length and sepal width variables to partition the 150 observations of sepal length and width and petal length and width into 3 clusters. Statistics package R is used to implement our IPARC procedure, where we set = 2 and = 0.05 or 0.2 and initial random seed being determined by set.seed(123456), and use only the least squares estimation in this example. The partition result and its comparison with the species information are summarized in Table 2. Comparing Tables  1 and 2, we see that the cluster information revealed by the cluster regression sepal.length ∼ sepal.width is very much the same as that by the -means and conforms with the species information.
When we use cluster regression sepal.length ∼ sepal.width + petal.length + petal.width to partition the data into 3 clusters, we get a result summarized in Table 3 which is very different from Tables 1 and 2. This confirms that the cluster label information obtained from regression clustering has a different interpretation from that obtained from the -means. The former tells us how differently regression performs across the clusters, while the latter tells us how distances among data observations themselves behave differently across the clusters. Fitting this regression clustering to the data gives SRSS = 2.901 and coefficients of determination of 0.972, 0.958, and 0.971, respectively, for the 3 regression hyperplanes. On the other hand, when we fit the same regression model to the 3 clusters determined by the -means, we get SRSS = 12.699   and coefficients of determination of 0.575, 0.525, and 0.578. Similar results are obtained if the same regression model is fit to the 3 clusters determined by the species variable. Therefore, regression clustering method is fundamentally different from the general cluster analysis methods such as the -means.
One should use regression clustering if partitioning data to conform to the regression pattern is of interest.

Simulation Study.
We use simulated data sets to assess the finite sample performance of Criteria LS-C and RM-C for regression clustering. Two factors will be considered for this simulation: number of clusters (2 or 3) and error distributions ( (0, 1), (3), or Cauchy(0, 1)), so there are in total 6 cases of data to be considered, which are summarized in Table 4.
There will be only one covariate involved in the regression clustering and the covariate is generated from (0, 1). The parameters used for each case are given in Table 5. Then, the fixed partition regression clustering model = x 0 + , = 1, . . . , , = 1, . . . , 0 , is applied to generate the response values , where is a random number originating from (0, 1), (3), or Cauchy(0, 1), and the first element of x is the constant 1 corresponding to the intercept term in the model. Figures 3 and 4 give us an intuition of what the data typically look like for Cases 1-6 with normal, (3), or Cauchy errors. These figures show that the groupings of the linear patterns are visible with standard normal random errors and getting worse with (3) random errors. The groupings are hard to see with Cauchy(0, 1) random errors.
In this study, we set ( ) = , where is the number of regression coefficients in the model and is a constant in our study; is the unknown number of clusters that we are seeking. It is noted that in an information model selection criterion, a penalty function, which is in (8) or (11), is usually chosen as log( ) or log log( ) with a constant > 0. In light of the fact that lim →0 [(log ) − 1]/ = log log , we set = [(log ) 3 − 1]/3. The function we employed for -estimation is ( ) = 0.5 2 if | | ≤ 1.345 and ( ) = 1.345| | − 0.5 × 1.345 2 otherwise (Huber ). In the following, when we state the simulation results, Criterion RM-C means -estimation based regression clustering procedure with Huber's exclusively.
For each of the six cases, we conduct 1000 simulations using Criteria LS-C and RM-C separately. To apply the algorithm IPARC, we set = 0.2 and = 2 . The algorithm given in the previous section is then used to estimate the number of clusters in cluster linear regression. In Tables 6 and  7, we summarize the results from the simulation study, where each number represents the relative frequencies of selecting the possible numbers of clusters out of 1000 replications.     It is clear that Criterion LS-C performs almost perfectly in Cases 1 and 4 since the errors are standard normal distributed. However, when there exist outliers in the data set or the normality of the data is violated, Criterion LS-C performs poorly. On the contrary, as shown in Tables 6 and 7, Criterion RM-C does as nearly perfect a job as Criterion LS-C in Cases 1 and 4; at the same time, neither outliers nor abnormality has much effect on its ability to detect the underlying true number of regression hyperplanes in the data.
In addition to the robustness shown above in selecting the number of clusters, the procedure of the -estimation based regression clustering is also robust in partitioning the data. Table 8 presents the estimation of the regression parameters by applying LS-C and RM-C to the data shown in Figures 3  and 4. From the table, it can be seen that when the errors are (3) or Cauchy(0, 1) distributed, the LS regression clustering method is not able to capture the underlying groupings, while the -estimation based regression clustering method detects the true linear patterns in the data, in spite of the abnormality in the data.

Analysis of RGB Cell Tracking Data
Recently, a new technique called RGB marking has been introduced to facilitate the identification of individual cell clones in both in vivo and in vitro experiments [24]. RGB marking introduces three lentiviral vectors in individual cells encoding the basic colors red, green, and blue. Raw image data representing 128 colorectal cancer cells are shown in Figure 5; the same data are to be analyzed in detail in this section. Since the colored cells are easily identifiable within whole organ structures, scientists can track the cells and determine their role during processes such as organ regeneration, malignant outgrowth, or immune responses.
To this end, scientists are required to cluster cell types according to some basic color combinations. Due to the variability of the vector insertion, however, single RGBmarked cells express fluorescent proteins at different and very characteristic levels. The underlying principle of additive color mixing, similar to that in computer or TV screens, generates different color combinations that can be used to discriminate individual cell clones. The main difficulty in this kind of data is that the intrinsic variability of the underlying biological mechanisms makes the actual number of distinguishable colors generated by RGB marking in a tissue difficult to predict. In addition, cell intensities for different colors are known to vary depending on the cell area, which is an indicator of cell morphology.
The data set analyzed in this section consists of measurements on colorectal cancer cell lines expressing various quantities of three different fluorescent proteins: Cerulean (blue), Venus (yellow/green), and mCherry (red). The genes coding for the fluorescent proteins were transferred into the cells via lentivirus-mediated transduction at a less than 100% efficiency so that most cells expressed different quantitative combinations of all three fluorescent proteins as described by Weber et al. [24]. The cells were imaged on a high-content imager (Operetta, Perkin Elmer). The final data consisted of fluorescent intensities of red, blue, and green color channels (electromagnetic wavelength in nanometers, nm), morphology parameters including cell areas, and spatial coordinates for 128 cells. Figure 5 shows the original data and clustering obtained by the LS regression clustering approach defined in (14), using multivariate regression with color intensities as the response vector, with morphological predictor (cell area) being used in (c), and without using any predictors in (d). Clustering methods are relatively robust to the initial random seed (here we used set.seed(111)) in both cases. When the cell area predictor is included, the resulting clustering changes, thus suggesting that the cell morphology information (cell area) plays a role in separating different cells types. In Table 9, we summarize the outcome for this LS regression clustering.
To select the optimal number of clusters, we used the information criterion function (8) for LS and (11) for RM, with ( ) = , where is the unknown number of clusters that we are seeking for. Figure 6 shows the optimal numbers of clusters using = log log (C1) and = log (C2) for both clustering approaches. Robust clustering is carried out using Huber's discrepancy function (6) with the tuning constant = 1.345 being chosen. The resulting optimal number of clusters based on C1 is 5 by both LS and RM regression clustering criteria, which is compatible with biological considerations.
Finally, we assess the performances of the LS and RM regression clustering and compare them with that of the  -means method. The prediction strength (PS) statistic introduced by Tibshirani and Walther [25] is used for the assessment.
For a candidate number of clusters ( = 5 in our case), letĈ te = {Ĉ te,1 , . . . ,Ĉ te, } denote the partition of the test set resulting from regression clustering on all the data. Let 1 , . . . , be the number of observations in these clusters. Let C tr be the partition of the test set resulting from regression clustering on the training set. In particular, in the latter case each data point in the test set is clustered using (4) witĥ , = 1, . . . , , produced by the training set. Following notations of Tibshirani and Walther [25], denote [Ĉ tr ,Ĉ te ] as the × comembership matrix, with th element [Ĉ tr ,Ĉ te ] = 1, if a pair of observations and Therefore, the prediction strength is the proportion of observation pairs in the worst performing test cluster whose clustering results remain unchanged when clustering them by the training set clustering rule. Clearly, a regression clustering result has higher predictive power if the associated PS is higher. For our data, we assess the clustering performance by cross-validation using 4 random partitions of our sample. Cross-validated prediction strength values for -means, LS, and RM regression clustering methods are 0.44, 0.80, and 0.66, respectively. This suggests that the LS regression clustering is superior to the -means. Moreover, due to the absence of strong deviations from the multivariate normal model for these data, the out-of-sample prediction strength of the LS regression clustering is larger than that of the robust RM regression clustering approach.

Conclusion
In this paper, we review the general cluster analysis methods and then focus on regression clustering which uses the model based fixed partition method and clusters the data based on the dependence between the response and explanatory variables. We provide both least squares based and robustestimation based methods for estimating parameters, partitioning the data, and selecting the optimal number of clusters in regression clustering. Algorithms have been developed to implement these methods. The example and simulation study conclude a satisfactory finite sample performance of the algorithms. Applying our developed method to regression cluster the RGB cells tracking data gives results compatible with biological considerations. It is known that the methods can only provide a local optimization solution and are computing intensive especially when the sample size is large. Currently, we are investigating these issues and expect to provide an improved solution to be reported elsewhere in the near future.