K-Profiles: A Nonlinear Clustering Method for Pattern Detection in High Dimensional Data

With modern technologies such as microarray, deep sequencing, and liquid chromatography-mass spectrometry (LC-MS), it is possible to measure the expression levels of thousands of genes/proteins simultaneously to unravel important biological processes. A very first step towards elucidating hidden patterns and understanding the massive data is the application of clustering techniques. Nonlinear relations, which were mostly unutilized in contrast to linear correlations, are prevalent in high-throughput data. In many cases, nonlinear relations can model the biological relationship more precisely and reflect critical patterns in the biological systems. Using the general dependency measure, Distance Based on Conditional Ordered List (DCOL) that we introduced before, we designed the nonlinear K-profiles clustering method, which can be seen as the nonlinear counterpart of the K-means clustering algorithm. The method has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles. Results from extensive simulation studies showed that K-profiles clustering not only outperformed traditional linear K-means algorithm, but also presented significantly better performance over our previous General Dependency Hierarchical Clustering (GDHC) algorithm. We further analyzed a gene expression dataset, on which K-profile clustering generated biologically meaningful results.


Introduction
In recent years, large amounts of high dimensional data have been generated from high-throughput expression techniques, such as gene expression data using microarray or deep sequencing [1], and metabolomics and proteomics data using liquid chromatography-mass spectrometry (LC-MS) [2]. Mining the hidden patterns inside these data leads to an enhanced understanding of functional genomics, gene regulatory networks, and so forth [3,4]. However, the complexity of biological networks and the huge number of genes pose great challenges to analyze the big mass of data [5,6]. Clustering techniques has usually been applied as a first step in the data mining process to analyze hidden structures and reveal interesting patterns in the data [7].
Clustering algorithms have been studied extensively in the last three decades, with many traditional clustering techniques successfully applied or adapted to gene expression data, which led to the discovery of biologically relevant groups of genes or samples [6]. Traditional clustering algorithms usually process data on the full feature space while emerging attention has been paid to subspace clustering. Traditional clustering algorithms, such as -means and expectation maximization (EM) based algorithms, mostly use linear associations or geometric proximity to measure the similarity/distance between data points [8].
When applying traditional clustering algorithms to the domain of bioinformatics, additional challenges are faced due to prevalent existence of nonlinear correlations in the high dimensional space [9]. However, nonlinear correlations are largely untouched in contrast to the relative mature literature of clustering using linear correlations [5,[10][11][12]. There are several factors making nonlinear clustering difficult. First, a pair of nonlinearly associated data points may not be close to each other in high-dimensional space. Second, it is difficult to effectively define a cluster profile (i.e., the "center" of a cluster) to summarize a cluster given the existence of nonlinear associations. Third, compared to measures that detect linear correlations, nonlinear association measures lose statistical power more quickly with the increase of random additive noise. Fourth, given the high dimensions, computationally expensive methods, for example, principal curves [13,14], are hard to be adopted even though they can model nonlinear relationships.
In this paper, we try to address these problems by developing a clustering method that can group data points with both linear and nonlinear associations. We name this method " -profiles clustering. " Our method is based on the previously described nonlinear measure: the Distance Based on Conditional Ordered List (DCOL) [15,16]. The key concept is to use data point orders in the sample space as the cluster profile. We have previously described a hierarchical clustering scheme named General Dependency Hierarchical Clustering (GDHC). However the computation of GDHC is very intensive. The new -profiles clustering method is much more efficient, representing a ∼20-fold reduction in computing time. Conceptually, it is the nonlinear counterpart of the popular -means clustering method, while the existing GDHC is the nonlinear counterpart of the traditional hierarchical clustering method. Another key advantage of the -profiles clustering method is that, by building statistical inference into the iterations, noise genes that do not belong to any cluster will not interfere with the cluster profile estimation, and they are naturally left out of the final results.

Distance Based on Conditional Ordered List (DCOL).
We first consider the definition of Distance Based on Conditional Ordered List (DCOL) in two-dimensional space. Given two random variables and and the corresponding data points {( , )} =1,..., , after sorting the points on -axis to obtain the DCOL is defined as Intuitively, when is less spread in the order sorted on , col ( | ) is small. We can use col ( | ) to measure the spread of conditional distribution | in a nonparametric manner [16].
The statistical inference on col ( | ) can be conducted using a permutation test. Under the null hypothesis that and are independent of each other, the ordering of the data points based on is simply a random reordering of . Thus we can randomly permute {( )} =1,..., times and calculate the sum of distances between adjacent values in each permutation. Then we can find the mean and standard deviation from the values sampled from the null distribution. The actual col ( | ) can then be compared to the estimated null distribution to obtain the value. Notice this process does not depend on . The permutation can be done once for and the resulting null distribution parameters apply to any , which greatly saves computing time.

Defining a Cluster Profile and Generalizing DCOL to
Higher Dimensions. Let U be a -dimensional random vector ( 1 , 2 , . . . , ), where each is a random variable; then an instance of random vector U can be seen as a point in the -dimensional space. Assuming instances of random vector U are sorted in the -dimensional space, then col ( | U) can be computed according to (2) for any random variable . Therefore, the key problem is to define the order of a series of -dimensional points. When is one-dimensional, we can easily prove that a list of numbers ( 1 , 2 , . . . , ) is sorted if and only if ∑ =2 | − +1 | is minimized. We generalize this to -dimensional space and define instances (u 1 , u 2 , . . . , u ) as sorted if and only if the sum of distances between the adjacent -dimensional points is minimized. Sorting the points is equivalent to finding the shortest Hamiltonian path through the points in dimensions, the solution of which is linked to the Traveling Salesman Problem (TSP) [17]. Many methods exist for solving the TSP [17].
If we consider the random variables as genes, we have effectively defined a profile for the cluster made of these genes. Using this profile, we can compute the col ( | U) for any gene and determine if the gene is close to this cluster, which serves as the foundation of the -profile algorithm.

The -Profiles Algorithm.
In this section, we outline the DCOL-based nonlinear -profiles clustering algorithm. First, we define the gene expression data matrix × , where samples are measured for genes and each cell is the measured expression level of gene on sample . Each row represents the expression pattern of a gene while each column represents the expression profile of a specified sample.
The -profiles clustering process is analogous to the traditional -means algorithm overall. However there are two key differences: (1) Different from the -means clustering algorithm, we use the data point ordering (Hamiltonian path) as the cluster profile rather than the mean vector of all data points belonging to this cluster; (2) during the iterations, the association of each point to its closest cluster is judged for statistical significance. Points that are not significantly associated with any cluster cannot contribute to the estimation of the cluster's profile.
Due to the random initialization of clusters, we use a loose value cutoff at the beginning and decrease it iteration by iteration as the updated cluster profiles become more stable and reflect the authentic clusters more reliably as the clustering process progresses.
(a) To start, we compute the null distribution of DCOL distances for each gene (row) and obtain two parameters, mean and standard deviation , for each gene simultaneously by permuting columns of the matrix 500 times. The gene-specific null distribution parameters are used to compute the values of the DCOL whenever assigning a gene to the closest cluster. (b) Initialize clusters by generating random orders as cluster profiles; set value cutoff to upper bound. (c) For each row vector, compute its DCOL distance to each cluster according to corresponding cluster profile col ( | U ), where is the th gene and U is the th cluster. Assign it to the closest cluster if the DCOL is statistically significant in terms of value. In this step, we are implicitly computing values for each gene and taking the minimum. Thus we need to adjust the value cutoff to address the multiple testing issue. We assume each cluster profile is independent of the others. Then it follows that, for each gene, the values are independent. Under the null hypothesis that the gene is not associated with any of the clusters, all the values are i.i.d. samples from the standard uniform distribution. Thus the nominal value cutoff is transformed to = 1 − (1 − ) 1/ .
(d) When all gene vectors have been assigned, recalculate the profile of each cluster using a TSP solver.
(e) Repeat steps (c) and (d) until the cluster profiles no longer change or the maximum iteration is reached.
We start with a loose value cutoff. In each iteration we reduce the value cutoff by a small amount, until the target value cutoff is reached.
The above procedure is conditioned on a given , the number of clusters. We used gap statistics for determination of . Other options such as prediction strength or finding   the elbow of the variance-cluster number plot are also available. Here we replace the sum of variances by the sum of negative log values.

Simulation Study.
We generated simulation datasets with 100 samples (columns) and gene clusters, each containing 100 genes (rows). Another 100 pure noise genes were added to the data. was set to 10 or 20 in separate simulation scenarios. Within each cluster, we set the genes (rows) to be either linearly or nonlinearly correlated using different link functions, including (1) linear, (2) sine curve, (3) box wave, and (4) absolute value (Figure 1).
In the hidden factor approach, for each cluster, we first generated the expression levels of a single controlling factor by sampling the standard normal distribution. Then for each gene, a function was randomly drawn from the four functions mentioned above (Figure 1). The gene was generated as the function of the hidden controlling factor plus certain level of noise from the normal distribution: (new) = ( ) + .
In the 1-dependent approach, the expressions of genes in a cluster were generated sequentially. The first gene was generated by sampling the standard normal distribution. From the second gene on, we first randomly chose one gene that was already generated and randomly chose one function from the four available functions (Figure 1). We then generated the new gene as the function of the previously generated gene: (new) = ( (selected) ). After the expression of all genes in a cluster was generated, certain level of noise was generated from the normal distribution and added to the gene expression profiles. The 2-dependent approach is similar to the 1-dependent approach. The difference is that, for each new gene, two previously generated genes were randomly selected, and two functions were randomly chosen. The new gene was generated as the summation: (new) = 1 ( (selected 1) ) + 2 ( (selected 2) ). The 's were sampled from the uniform distribution between −1 and 1. Again certain level of noise was generated from the normal distribution and added to the gene expression profiles.

Simulation Results.
In the simulation experiments, we compared the -profiles algorithm with General Dependency Hierarchical Clustering (GDHC) and the traditional -means clustering algorithm. The GDHC was paired with the dynamic tree cutting method to cut the trees into clusters [18]. We used the efficient TSP R library to compute the cluster profiles [19]. We adopted the external evaluation metric Adjusted Rand Index (ARI) [20] to compare the clustering results with the true cluster memberships to judge the performance of the methods. In Figure 2, the average ARI values were plotted against the noise level. Higher ARI values indicate better clustering performance. The figure contains three columns and two rows with each column representing a data generation mechanism and each row representing a different number of clusters. In the left column, data was generated by the hidden factor mechanism, where all features in a true cluster were linearly/nonlinearly linked to a latent factor. In columns 2 and 3, features in each cluster were generated using 1-dependent and 2-dependent mechanism, respectively. In such a generation mechanism, genes generated later depend on previously generated genes in the same cluster [15]. In the meantime, the first row shows results from data with 10 clusters, while the second row shows results from data with 20 clusters.
For GDHC, we used the dynamic tree cutting method [18] to cut each tree. Various values of minimum cluster size were tested. For -profiles clustering, we started with a value cutoff of 0.2 and gradually reduced the cutoff to 0.05 with the iterations. We ran each setting (cluster size, data generation scheme) 20 times and plotted the average results in Figure 2. We can see obviously that both -profiles and GDHC outperformed linear relation-based -means clustering algorithm significantly in all cluster parameter settings. -profiles also did a better job than GDHC in recovering the true clusters. We allowed four minimum cluster size levels in the dynamic tree cutting, 50%, 75%, 95%, and 100%, of the true cluster size. Generally the 50% setting performed the best. Figure 3 shows the confusion matrices of an example clustering result as images. We can see the composition of the reported clusters by the three different clustering algorithms. Cleaner images indicate better agreement between true clusters and the detected clusters. When looking into all three confusion matrices, we can see that in each reported cluster our proposed method discovered a dominant group with only a little impurity. However, in traditional -means clustering, the reported clusters were mostly composed of several small groups, which rendered it of little use when the data contains much nonlinear relations. GDHC performed much better than -means with 4 reported clusters (rows) composed mostly of elements from the same true clusters. Clearly, the new -profiles clustering method achieved the best performance in the simulations.
The -profiles and GDHC clustering methods were both based on DCOL, which detects both nonlinear and linear relationships, although it has lower power to detect linear relationship compared to correlation coefficient. Next we studied how the methods behave when the true relationships are all linear. We used the same hidden factor data generation scheme but allowed only linear relations in the data generation, which means all genes in the same cluster were linearly related to the same hidden factor. We simulated data with 10 clusters, each containing 100 genes, plus an additional 100 pure noise genes. -profiles achieved similar performance to -means when the noise was at low to moderate levels ( Figure 4). This is likely due to the fact that -means does not involve statistical testing to exclude noise genes from the clusters.
Besides being a more effective nonlinear clustering method, the -profiles method is also more efficient compared to GDHC. On a data matrix with 2000 rows and 100 columns, the average computing time of -profiles was ∼30 seconds on a laptop with i7-3537U CPU and 6Gb memory, while the GDHC used ∼600 seconds.

Real Data Analysis.
We conducted data analysis on the Spellman yeast cell cycle data, which consists of four time series synchronized by different chemical reagents, each covering roughly two cell cycles [21]. One of the time series, the cdc15 data, contains a strong oscillating signal [22]. We removed the cdc15 dataset and used the data of the three remaining time series. The data matrix consists of 49 samples (columns) and 6178 genes (rows).
We applied the -profiles clustering method using a series of values. With each value, we retained the final value of every gene. We then took the negative sum of log values ∑ − log( ) at every and plotted the value against . An elbow was observed at around 30 ( Figure 5). Thus we chose = 30 for subsequent analyses.
Among the 6178 genes under study, 4874 were clustered into 30 clusters. The minimum cluster size was 59, and the maximum cluster size was 328. We then judged the performance of the methods using functional annotations. For this purpose, we resorted to Gene Ontology [23]. We used a set of GO terms that categorize genes into broad functional categories, the GO slim terms from the Saccharomyces Genome Database (SGD) [24]. Some of the GO slim terms are too broad; we limited our analysis to terms with 2000 annotated genes or less. We found that almost all the clusters are associated with certain GO slim terms using the hypergeometric test [25] for overrepresentation ( Figure 6).
From Figure 6, we see clearly that several clusters, including clusters 2, 5, 7, and 12, are highly associated with cell cycle related processes, which are clustered in the lower 1/3 region of the plot ( Figure 6). We then plotted the heatmaps of the expressions of the genes in these clusters, which indeed showed strong periodical behavior. An example, cluster 2, is presented in Figure 7. We notice the genes in this cluster were mostly periodic genes, yet they exhibit different phase shifts. Such genes may not be clustered together using traditional methods based on linear associations.