2-Way k-Means as a Model for Microbiome Samples

Motivation. Microbiome sequencing allows defining clusters of samples with shared composition. However, this paradigm poorly accounts for samples whose composition is a mixture of cluster-characterizing ones and which therefore lie in between them in the cluster space. This paper addresses unsupervised learning of 2-way clusters. It defines a mixture model that allows 2-way cluster assignment and describes a variant of generalized k-means for learning such a model. We demonstrate applicability to microbial 16S rDNA sequencing data from the Human Vaginal Microbiome Project.


Introduction
Microbiome analysis [1] by sequencing of ubiquitous genes, most commonly 16S rRNA, is a standard, cost-effective way to characterize the composition of a microbial sample. Standard analysis tools facilitate quantifying the fraction of sequence reads from each bacterial species in a sample [2]. Interpretation of composition vectors across a collection of samples typically relies on dimensionality reduction followed by clustering in the lower-dimensionality space [3]. This allows identification of functionally meaningful subsets of samples with characteristic microbiota. The Human Microbiome Project [4] and its derivatives such as the Human Vaginal Microbiome Project [5] have collected and thus analyzed large numbers of samples towards elucidating the structure and composition of microbiota across physiological and pathological states.
Similar to variation in microbial genomes across different human individuals, variants along the nuclear genomes have been summarized by a small number of dimensions [6]. However, in contrast to analyses of microbiome samples, analyses of inherited genetic variation standardly assume and observe samples to be spread across a continuum in the reduced space, rather than be clustered [7]. Samples in between clusters are interpreted as originating from intermediate locales along a geographic cline [8] or as representing different levels of a mixture between cluster-specific populations.
In this paper, we formally tackle the problem of clustering while allowing elements to belong to two clusters. Specifically, we will describe in detail a model for clustering in ℝ d . We construct a model that generalizes k-means clustering by allowing data points to be assigned to a point in the space along the line between two assigned clusters [9]. Each cluster is still modeled as a Gaussian with uniform, spherical covariances; the key difference is the presence of a parameter u ∈ 0, 1 for each 2-way-assigned data point x i , which determines the proportional assignment of x i between its two cluster representatives. We first describe the 2-way model's inputs, parameters, and outputs. We then give the objective function, an algorithmic description, and a series of performance metrics. Next, we evaluate the performance on simulated data, describing benchmarks for optimal performance. Finally, we apply the model to real data of 16S rDNA sequencing from 1500 midvaginal bacterial samples by the Vaginal Human Microbiome Project. uniform, spherical Gaussian distributions or from pairwise weighted averages of these Gaussians.
Formally, we describe a generative model for a set X of data points x i n 1 ∈ ℝ d . The model involves k ∈ ℤ + clusters. The jth cluster is parametrized by its mean μ j ∈ ℝ d . To simulate x i , the model first chooses a pair of cluster indices (j, j ′ ) along with a weighting u i ∈ 0, 1 . x i is drawn from a Gaussian distribution whose parameters are u i -weighted averages of two representative clusters. Specifically, The inference problem involves the inputs of data X and number of clusters k, seeking output of the generative model parameters, that is, the vectors of assignments C = c 1 , …, c n and weights U = u 1 , …, u n .

Generalized k-Means.
Given input x 1 , …, x n ∈ ℝ d and cardinality k ∈ ℕ, k-means traditionally provides us with the following objective: where c 1 , …, c j are the cluster representatives. The k-means objective can be generalized as the following: where Φ = ϕ 1 |ϕ 2 |⋯|ϕ n ∈ 0, 1 k×n are the cluster assignments and C = c 1 |c 2 |⋯|c k ∈ 0, 1 d×k are the cluster representatives.
A common generalization of k-means is to permit each ϕ i to have s nonzero entries (in our case, we set s = 2). An algorithm for this generalized objective is simply to hold C fixed while performing sparse regression on Φ and then hold Φ fixed and use ordinary least squares (OLS) to find C.
In our case, because we only allow points x i to lie uniformly between two cluster representatives, the two nonzero entries in a given ϕ i are restricted to some u i ∈ 0, 1 and 1 − u i ∈ 0, 1 . Our problem is instead the following: subject to u i ∈ 0, 1 For a given c j and c j′ , minimizing with respect to u ijj′ reveals a global minimum at After minimizing with respect to u ijj′ , we project u ijj′ to the region 0, 1 . We set u ijj′ = 0 if the minimizer is less than 0 and set u ijj ′ = 1 if the minimizer is greater than 1. This allows us to achieve the minimum value of u i over the domain 0, 1 for x i .
After minimizing the assignment Φ, we then use OLS to pick optimal C as specified before. Formally, OLS produces a vector c T i that minimizes the squared residual error between an input matrix Φ T and vector Taking the gradient and setting equal to zero yields the following formula: Thus, we perform OLS for all vectors c T i at once with matrix multiplication: Thus, this gives us representatives c 1 , …, c k that minimize the residual error between the cluster representatives and data points subject to Φ. We then alternate this process for r rounds until convergence.

Performance Metrics.
We use the 2-way k-means objective as a performance metric in measuring the accuracy of a model in unsupervised examples.
where Φ has at most two nonzero entries with values u i ∈ 0, 1 and 1 − u i ∈ 0, 1 . Additionally, we also use four different error rates to measure the accuracy of 2-way k-means on test cases. Let c * i , μ * j , and u * ijj′ be the ground truth instance parameters, that is, respectively, true 2-way cluster assignment of x i , center of cluster j, and 2-way weighting for x i between clusters j, j′ .
err f x defines the 0-1 error rate for 2-way cluster assignment: err μ defines the squared deviation from optimal μ * : err u defines the squared deviation from optimal u * ijj ′ . WLOG, we assume u ijj ′ = max u, 1 − u , where u is the variable drawn from 0, 1 :

Example Run for 2-Way k-Means.
We find it illuminating to demonstrate the performance of 2-way k-means versus vanilla k-means on a cartoon example.
In Figure 1, we simulated n = 500 data points in ℝ 2 from three clusters, with respective means μ 1 = 0, 0 , μ 2 = 1, 1 , and μ 3 = 2, 0 and covariance matrices Σ = 0 001I. Data points are drawn into pairwise clusters by choosing two cluster representatives without replacement from the following prior probabilities: k-means predicts the 2-way cluster assignments of ≈11% points incorrectly due to skewing the cluster means toward the middle of the graph. 2-way k-means, however, significantly improves on all error rates. After 10 rounds of 2-way k-means, we achieve the results in Figure 3. For every statistic, the results are clearly an improvement on standard k-means. The ≈3% error rate on cluster assignment still exists because 2-way k-means points closest to cluster representatives may be assigned to an incorrect secondary cluster.

Benchmarks
3.2.1. Sparsity (Avg. of 10 Trials, 10 Rounds Each). Our sparsity test was conducted by keeping cluster prior probabilities and cluster centers μ constant while varying the number of data points (ratio of α means n = 500α). From Figure 4, we see that the algorithm performs consistently well under a variety of conditions, but too few data points can hurt performance to an extent. (Avg.of10Trials, 10RoundsEach).We test the error rate as a function of the Euclidean distances of μ (ratio of α means μ 1 = α 0, 0 , μ 2 = α 1, 1 , and μ 3 = α 2, 0 ). From the results in Figure 5, we can see that a certain threshold isrequiredforproperperformanceofthealgorithm.Thismakes sense, as when α = 0 01, the clusters are almost on top of each other and difficult to distinguish. Additionally, as the cluster centers aremovedfartherapart,theℓ 2 normbetweenthecluster representative determined by thealgorithmand the actualcluster representative increases (but this is to be expected).

3.2.2.ClusterSeparation
3.2.3. Variance (Avg. of 10 Trials, 10 Rounds Each). We increase the variance of the clusters while fixing cluster prior probabilities, data points, and cluster centers (ratio of α means Σ = α 0, 0 0001 , 0, 0 001 ). From the results in Figure 6, we can see that large variance hurts proper performance of the algorithm. Analogous to cluster separation, as when α = 100, the clusters are too close to distinguish.

Real Data.
Publicly available sequence data for the Human Microbiome Project (HMP) study SRP002462, described as metagenomic sequencing of 16S rDNA from vaginal and related samples from clinical and twin subjects, was downloaded from the NCBI SRA database [10]. The downloaded sets of data correspond to two separate submissions: SRA169809 (1608/1608 samples were downloaded) and SRA273234 (34/133 samples were downloaded), for a total of 1642 samples.
The SFF files were processed and cleaned using the microbial community analysis software mothur [11], based on a standard protocol developed for 454 sequence data processing and quality control [12]. The dissimilarities between the samples were calculated using the Clayton-Yue dissimilarity measure. The data was subsampled to 5000 sequences per sample (this step results in dropping out 136 samples that had less than 5000 reads in total) 500 times to produce the distance file, which was used to calculate principal coordinates. Figure 7 shows the graph of~1500 data points after PCoA. After implementing the 2-way k-means algorithm [13], we initialized with k-means k = 5 and ran 2way k-means for 5 rounds on the data.
Unfortunately, the nonlinear arches between the clusters pushed the cluster representatives slightly outside the clusters. Nonetheless, the algorithm was still an improvement over k-means. We note that after k-means, the 2-way objective had a value of 108.0 while our 2-way k-means algorithm converged on an objective of ≈51 0 after 5 rounds. Additionally, the algorithm gives us a characterization of the samples lying between two clusters. The results can be seen in Figure 8.

Discussion.
We first get the most abundant operational taxonomic unit (OTU) in each sample (down to the genus level) and the closest cluster assignment for each sample. We use this to observe which OTUs are the most common to each cluster. We can find the closest sample to each data point by simply taking the argmax u for each data point x i . From Table 1, we see that four of the five clusters have a unique and most abundant OTU, while cluster c 3 has a variety of abundant types. Aside from the top four OTUs, separating the data into discrete clusters obscures how the rest of the OTUs can be characterized.
By using each data point's cluster-pair assignment, we further separate the data into k 2 − k clusters. Let c jj ′ designate the data points that are between clusters j and j ′ but are nearer to cluster j than cluster j′. We take the most abundant OTUs in each sample and the cluster pair for each sample. We can then find the most abundant OTUs for each cluster pair. Table 2 shows the structure of the most abundant OTU types for each 2-way cluster c jj′ defined before. Once again, we find that clusters c 1j , c 2j , c 4j , and c 5j are all dominated by the same single OTU from before. Yet, observing clusters c 3j′ provides us with a more in-depth understanding of the diverse cluster c 3 .
Interestingly, we see that the makeups of c 31 , c 32 , c 34 , and c 35 are remarkably different. We immediately see that the top four OTUs are all predominantly contained in cluster pairs that include their single main cluster. In addition, we notice that the samples with abundant Sneathia 1 , Prevotella 2 , and unclassified types are predominantly contained in c 31 . c 32 contains samples with a variety of abundant OTUs. Lactobacillus 3 , Lactobacillus 4 , Prevotella 1 , Streptococcus 1 , Streptococcus 2 , and Bifodobacterium are abundant in samples that are   In this way, 2-way k-means also opens up a wealth of information on the relationships between samples. In particular, it now makes more sense to characterize the samples as being in 6 different clusters: c 1 , c 2 , c 31 , c 34 , and c 5 . We also see that certain clusters have mixed relationships, while others have almost no interaction. Without 2-way k-means, this would not be immediately obvious.

Conclusion
The complexity of microbial populations is unfolding as microbiome data becomes increasingly available. Yet, standard methodologies oversimplify microbial compositions by pigeonholing them into discrete clusters. This paper further refines the models for microbial abundance across groups of samples. We allow samples to be presented as a weighted average of two clusters, rather than belonging to only one. This may be motivated biologically, as the sample often reflects a mixture of two sources of microbiota, each well represented by a cluster. An alternative explanation is that the averaged sample represents an intermediate, potentially temporary state of the microbial composition, between the more stable ones represented by the clusters themselves.
Technically, we formalize this model as a generalization of k-means. We derive a simple algorithm to infer such a structure and validate its benchmarks on simulated data.
Applying our algorithm to real data from the Human Vaginal Microbiome Project provides empirical support to the 2-way model. We showed that while most of the samples lie in six clusters: four well-defined clusters and two subclusters. Furthermore, while previously, a sizable fraction of samples in between clusters was ignored, the 2-way model characterized the entire distribution. Using 2-way k-means, we can tell that a large portion of the previously unclustered samples, which lie in between two clusters, contains shared properties. In addition, we see that certain clusters have mixed relationships, while others have almost no interaction.

Further Research
In addition, this paper leaves several open questions and opportunities for further research: (i) How can we efficiently characterize a 2-way distribution with nonspherical covariance matrices?
(ii) How can we efficiently characterize a k-way distribution?
(iii) How can we efficiently characterize a 2-way distribution with nonlinear paths between cluster representatives?
Addressing these questions will further help us understand the composition of microbial populations.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.