We present a framework for the identification of cell subpopulations in flow cytometry data based on merging mixture components using the flowClust methodology. We show that the cluster merging algorithm under our framework improves model fit and provides a better estimate of the number of distinct cell subpopulations than either Gaussian mixture models or flowClust, especially for complicated flow cytometry data distributions. Our framework allows the automated selection of the number of distinct cell subpopulations and we are able to identify cases where the algorithm fails, thus making it suitable for application in a high throughput FCM analysis pipeline. Furthermore, we demonstrate a method for summarizing complex merged cell subpopulations in a simple manner that integrates with the existing flowClust framework and enables downstream data analysis. We demonstrate the performance of our framework on simulated and real FCM data. The software is available in the flowMerge package through the Bioconductor project.
Flow cytometry (FCM) can be applied in a high-throughput fashion to process thousands of samples per day. However, data analysis can be a significant challenge because each data set is a multiparametric description of millions of individual cells. Consequently, despite widespread use, FCM has not reached its full potential due to the lack of an automated analysis platform to assist high-throughput data generation.
A critical bottleneck in data analysis is gating, the identification of groups of similar cells for further study. The process involves identification of regions in multivariate space containing homogeneous cell populations of interest. Generally, gating has been performed manually by expert users, but manual gating is subject to user variability, which can potentially impact results [
A number of methods have been developed to automate the gating process [
A more recent approach compensates for these effects by applying a data transformation during the model fitting process [
These model-based gating methods effectively amount to clustering of the data and generally employ likelihood-based measures such as the Bayesian information criterion (BIC) or Akaike information criterion (AIC) to select an appropriate model (number of clusters) from a range of possibilities [
An alternative measure recently proposed for model selection is the Integrated Complete Likelihood (ICL)[
In flow cytometry, where the shapes of cell populations can be asymmetric and nonconvex, neither of the above model fitting criteria are well suited to the clustering problem. An ideal model would allow multiple mixture components to represent an individual cluster or cell population, thus providing a good fit to the data and a good estimate of the number of distinct clusters. Such an algorithm has recently been proposed for Gaussian mixture models (GMMs) [
Here we extend the work of Baudry et al. to subpopulation identification in flow cytometry data [
Distributional assumptions, data transformation, and model selection criteria for the five clustering models compared in this study.
Distribution | Transformation | Model selection criteria | Model name |
---|---|---|---|
Multivariate- | Box-Cox | BIC | flowClust |
Box-Cox | ICL | flowClust | |
Box-Cox | Fixed K | flowClust | |
Box-Cox | BIC, entropy | flowMerge | |
Box-Cox | BIC, entropy, fixed K | flowMerge | |
Gaussian | None | BIC | GMM |
None | ICL | GMM | |
None | fixed K | GMM |
Employing the cluster merging algorithm under the flowClust framework provides a better fit and a better estimate of the number of distinct cell populations for complicated flow cytometry data distributions, than either the flowClust
We embed the cluster merging algorithm within the flowClust framework available in BioConductor [
We have implemented the cluster merging algorithm described in [
Baudry et al. suggest two data-driven approaches for choosing the optimal
It is important to be able to have a parametric representation of merged clusters in order to summarize characteristics of the population. To this end, we model a merged cluster as a multivariate
The expressions in (
Our stopping criteria for merging are based on analysis of the number of clusters in a solution versus the clustering entropy of that solution. Intuitively, when mixture components overlap significantly, the entropy of clustering will be a large value. As components are combined in subsequent iterations of the merging algorithm, the entropy will decrease. When only well separated components are left in the clustering solution, further merging will have little impact on the total entropy of clustering. This is reflected in a change of slope in the plot of the clustering entropy versus the number of components at the point, where the remaining clusters are well separated. We refer to this as the optimal flowMerge solution.
We formalize this idea by fitting piecewise linear regression to the entropy versus the number of clusters in the series of flowMerge model and allow the regression to have either one or two segments (i.e., one or no changepoint). Furthermore, we force the location of the changepoint to be an integer, thus reflecting the discrete nature of the clustering. Formally, if we have
When
We flag potential cases where the merging algorithm fails to identify a good solution through several different criteria.
If the number of clusters in the flowMerge solution is equal to the number of clusters in the flowClust
If the number of clusters in the flowMerge solution is less than the number of clusters in the flowClust
If no changepoint is detected (BIC chooses no change point model).
If the entropy of the flowMerge solution is unusually high (an outlier) compared to the entropy of the flowMerge solution for comparable samples using the same markers.
In the above cases, samples are flagged for manual inspection of the automated gating. To facilitate the comparison in (
We applied the cluster merging algorithm to a real-world data set consisting of 137 samples from 18 individuals with CLL (chronic lymphocytic leukemia) provided by the BC Cancer Agency. The data set is composed of between six and seven samples per individual. Each sample is labeled with three fluorescent markers. The entire panel of markers is designed for immunophenotyping of lymphomas in a clinical setting (Table
Summary of the antibody markers used in the CLL data.
Antibody combination | Ab1 | Ab2 | Ab3 | No. tubes |
---|---|---|---|---|
1 | CD10 | CD11 | CD20 | 18 |
2 | CD45 | CD14 | CD19 | 18 |
3 | CD5 | CD19 | CD3 | 18 |
4 | CD5 | CD19 | CD38 | 5 |
5 | CD5 | ZAP70 | CD19 | 1 |
6 | CD5 | ZAP70 | CD3 | 1 |
7 | CD57 | CD2 | CD8 | 4 |
8 | CD57 | CD56 | CD3 | 4 |
9 | CD7 | CD4 | CD8 | 13 |
10 | FMC7 | CD23 | CD19 | 18 |
11 | IgG | IgG | IgG | 1 |
12 | IgG1 | IgG1/IgG2a | IgG2 | 13 |
13 | Kappa | Lambda | CD19 | 18 |
We performed automated gating using flowClust on the forward scatter and side scatter channels, followed by cluster merging of the optimal flowClust
We simulated data from the empirical distribution of a real FCM data set. Based on the CD8 versus CD4 projection of a CLL sample, we estimated the empirical distribution using a two-dimensional kernel density estimator on a 100 by 100 point grid, and sampled 100 data sets of size
We compared the number of clusters identified by the flowClust
flowClust
Examples of the flowClust
In contrast, the flowClust
The flowMerge solution derived from the flowClust
We performed automated gating in the fluorescence channels on the lymphocyte subpopulation derived from the previous autogating step. In 60/137 cases (43%), the GMM
The number of clusters chosen by the flowClust
The flowClust
Example of flowClust
In contrast, for the same sample, the flowClust
The choice of the number of clusters for the flowMerge solution is automated by fitting a piecewise linear model to the entropy versus number of clusters (Figure
We identify cases where cluster merging fails by examining the distribution of the entropy of the flowMerge solution across multiple comparable samples (Figures
Detecting failed cluster merging. (a) Distribution of the entropy (normalized for the number of events and clusters) of the flowMerge solution for forward versus side scatter (left) and fluorescence channels (right) across 137 samples. (b) The relationship between the normalized entropy and the number of clusters in the flowMerge solution for forward scatter versus side scatter (left) and fluorescence channels (right). (c) Example of flowMerge solutions with unusually high normalized entropy from the right tail of the distribution for forward versus side scatter (left) and fluorescence (right). (d) A plot of the normalized entropy versus samples grouped by antibody labels identifies antibody combinations that are problematic for automated gating with the automated merging algorithm.
Simulation results for CD4 versus CD8 dimensions of a CLL sample. (a) The 2D kernel density estimate of the real CD4 versus CD8 data. Gates for the CD4+/CD8
We simulated 100 data sets of CD8 versus CD4 fluorescence based on the empirical distribution of real CD8 versus CD4 CLL data. This simulation approach ensured that the simulated data was not biased towards any of the models under investigation. This data had three cell subpopulations defined based on the contours in the CD4 versus CD8 dimensions. These included CD4+/CD8
We compared the number of clusters selected under the optimal flowClust
Mean, standard deviation, 95% coverage, and bias of the estimated number of clusters for each model, as well as the mean, standard deviation and 95% coverage for the misclassification rate of each model. CI: coverage interval.
Statistic | Model | Mean | SD | 95% CI | Bias |
---|---|---|---|---|---|
Number of clusters | flowClust | 9.03 | 1.59 | 6–12 | 6.03 |
flowClust | 2.00 | — | 2-2 | ||
GMM | 10.41 | 1.31 | 8–12 | 7.14 | |
flowMerge | 5.45 | 0.97 | 4–7 | 2.45 | |
Misclassification rate | flowClust | 0.103 | 0.00826 | 0.0937–0.112 | — |
GMM | 0.124 | 0.00537 | 0.114–0.134 | — | |
flowMerge | 0.0445 | 0.0104 | 0.0312–0.0669 | — | |
Misclassification rate (best model) | flowClust | 0.398 | 0.101 | 0.230–0.613 | — |
GMM | 0.499 | 0.0756 | 0.339–0.625 | — | |
flowMerge | 0.0685 | 0.0223 | 0.0383–0.121 | — |
We also compared the misclassification rates for the different models, relative to class assignments from manual gating. This was done in two ways. First, we fixed the number of clusters to the true number
The misclassification rates for the optimal flowClust
Model-based automated gating of flow cytometry data is difficult when cell subpopulations are nonconvex, or have complicated multidimensional shapes that are not readily modeled by single components of simpler multivariate distributions. This issue is resolved, in part, by allowing multiple mixture components to represent the same cell subpopulation. However, for further analysis, cell subpopulations are generally summarized by a variety of statistics; this requires one to summarize an arbitrary number of mixture components for a single cell subpopulation. Consequently the cluster merging algorithm is not suitable for application to flow cytometry data without further modifications. By taking advantage of the fact that a merged cluster is itself a mixture (see (
Comparison of the cluster merging algorithm with other automated gating models (Table
We use a changepoint model to estimate the optimal number of clusters in the merged solution. This allows the cluster merging algorithm to be implemented in a high-throughput pipeline for flow cytometry data analysis. In general, this approach provides satisfactory results, both for forward versus side scatter dimensions as well as for fluorescence dimensions (Figures
Our results on real flow data demonstrate that the cluster merging algorithm improves our ability to identify the lymphocyte cell subpopulation from the forward versus side scatter dimensions. This high density subpopulation is often represented by multiple mixture components in the flowClust
Our cluster merging framework provides a robust modeling approach for automated gating of flow cytometry data. It provides a good compromise between the flowClust
The authors thank Andrew Weng and Randy Gascoyne for providing the flow cytometry data. This work was supported by the NIH Grant EB008400, by the Michael Smith Foundation for Health Research (for the second and third authors), and by the Natural Science and Engineering Research Council of Canada (for the second author).