^{1}

^{2}

^{3}

^{4}

^{4}

^{3}

^{2}

^{1}

^{1}

^{2}

^{3}

^{4}

We demonstrate the application and comparative interpretations of
three tree-based algorithms for the analysis of data arising from
flow cytometry: classification and regression trees (CARTs), random
forests (RFs), and logic regression (LR). Specifically, we consider
the question of what best predicts CD4 T-cell recovery in HIV-1
infected persons starting antiretroviral therapy with CD4 count
between 200 and 350 cell/

Advances in flow cytometry, and particularly technological developments that facilitate acquisition of multiparameter defined phenotypes, present new and exciting opportunities for predicting patient outcomes based on individual specific cell subset changes. This is specifically relevant in the context of studying human immunodeficiency virus (HIV), where there exists a great potential to draw from the rich array of data on host cell-mediated response to infection and drug exposures, to inform and discover patient level determinants of disease progression and/or response to antiretroviral therapy (ART). We describe three existing analytic approaches, designed specifically for uncovering complex structures, and their applications to high density multiparameter cell subset data arising from the use of flow cytometry technology. We demonstrate the usefulness of each approach for novel discovery in this context as well as the contrasting clinical associations that each approach is tailored to address.

The data motivating our research were collected during the pre-randomization stage of the South Africa Structured Treatment Interruption (SASTI) trial, an on-going noninferiority trial that aims to determine whether patients whose ART is interrupted after achieving immune control on therapy will continue to retain the immune reconstitution benefits of therapy. Data on multiple immunological parameters were collected, by way of flow cytometry, on all study participants at start of ART and periodically over the course of the trial. The aim of our present investigation is to illustrate how tree-based machine learning algorithms can be applied to characterize the predictive capacity of a large number of immunological variables, collected at therapy initiation, with regard to a single, clinically relevant measure of immune reconstitution at a fixed time point on continuous therapy and prior to randomization.

We begin by presenting briefly a commonly applied, univariate analysis approach for testing the association between each immunological parameter, individually, and the outcome of interest. We then present three tree-based methods that are designed for discovery of complex structures of association in high-dimensional data settings: (

Notably, the usefulness of CART for immunophenotyping is discussed in Beckman et al. [

The SASTI trial began in 2006 and led to the successful recruitment of

Cellular immunophenotypes were studied using flow cytometry. Stainings were performed on fresh whole blood at the Department of Hematology and Molecular Medicine, National Health Laboratory Service and University of the Witwatersrand, Johannesburg, South Africa. Briefly, whole blood samples were stained for surface marker detection using fluorochrome-labeled monoclonal antibodies (mAbs) lyophilized on 96-well plates (Lyoplates, BD Biosciences, San Jose, CA). Fluorochrome binding was detected using a 4-color FacsCalibur flow cytometer (BD Biosciences). Cellular subests were analysed using proprietary software (CellQuest, BD Biosciences). Percent of positive cells was calculated based on isotype-matched control mAb binding. Whole blood samples were stained with monoclonal antibody (mAb) combinations (given in Table

Background staining was assessed using isotype-matched mAb (staining 1—this method is generally considered acceptable for surface flow cytometry of lymphocytes).

Postrun electronic event gating was performed using CellQuest software (BD Biosciences), based on the use of multiple 2-color quadrants. A first gating assessed expression of CD3 and CD8 (stainings 2, 3, 4, 6), CD3 and HLA-DR (staining 5), CD3 and CD45 (staining 7), and Lin-1 and HLA-DR (staining 8). Events falling in the quadrants of interest were further gated using quadrants to explore the expression of the remaining markers. The number of events falling in each quadrant was collected. Results are expressed as percent of gated/total events unless otherwise specified.

For T cell subset assessment, the CD4+ T lymphocyte subset was directly stained using CD4 mAb only in staining 7 (single platform CD4 count [

4 Color stainings employed for flow cytometric analysis.

Staining no. | FITC | PE | PerCP cy5.5 | APC |
---|---|---|---|---|

1 | Ig | Ig | Ig | Ig |

2 | CD45RA | CD62L | CD3 | CD8 |

3 | CD38 | CD28 | CD3 | CD8 |

4 | HLA-DR | CD95 | CD3 | CD8 |

5 | CD56 | CD16 | CD3 | HLA-DR |

6 | CD7 | CD154 | CD3 | CD8 |

7 | CD8 | CD4 | CD45 | CD3 |

8 | Lin-1 | CD123 | HLA-DR | CD11c |

In this paper, we focus on assessing the relationships among multiple baseline flow cytometry variables collected at initiation of ART and the variability in achieving a robust CD4+ T-cell count rise on ART, in the context of restricting the range of starting CD4 count between 200 and 350 cells/

Univariate associations with CD4+ count at 36 weeks on ART.

Predictor | Odds ratio | |
---|---|---|

CD3-DR-CD56+CD16+ | 0.183 | .008 |

Lin-DR- | 0.228 | .018 |

CD45+CD3+ | 0.274 | .035 |

CD3+CD8+CD38+CD28+ | 0.281 | .047 |

CD3-CD8+ | 0.323 | .084 |

CD3-DR+CD56+CD16+ | 0.339 | .087 |

CD3+CD8-CD7+CD154+ | 0.339 | .087 |

CD3-DR-CD56-CD16- | 0.364 | .113 |

CD3+CD8+CD7+CD154+ | 0.388 | .463 |

CD3+CD8-CD7+CD154- | 0.389 | .146 |

CD3+CD8-DR+CD95- | 0.429 | .189 |

CD3+DR- | 0.460 | .236 |

CD3+CD8-CD45RA+CD62L+ | 0.477 | .283 |

CD3+CD8-DR+CD95+ | 0.477 | .283 |

CD45-CD3+ | 0.494 | .632 |

CD3+CD8- | 0.494 | .632 |

Lin-DR+CD123+CD11c+ | 0.564 | .424 |

CD3+CD8+DR+CD95+ | 0.628 | .571 |

CD3-DR+ | 0.646 | .586 |

CD3+CD8-DR-CD95- | 0.646 | .586 |

CD45+CD3+CD8-CD4- | 0.703 | .690 |

CD3+CD8+CD38-CD28+ | 0.709 | .699 |

CD3+CD8-CD7-CD154- | 0.740 | .774 |

CD3+CD8+ | 0.752 | .786 |

CD3+CD8-CD38+CD28+ | 0.759 | .797 |

CD3+CD8+CD38+CD28- | 0.760 | .812 |

CD3-DR+CD56-CD16- | 0.805 | .887 |

CD3-DR+CD56+CD16- | 0.805 | .887 |

CD3+CD8+CD38-CD28- | 0.813 | .898 |

CD45-CD3- | 0.862 | .989 |

Lin-DR+CD123+CD11c- | 0.913 | .917 |

CD3+CD8+CD45RA+CD62L- | 0.923 | .908 |

CD3+CD8+CD45RA-CD62L | 0.931 | .898 |

CD45+CD3+CD8-CD4+ | 0.931 | .898 |

CD3+CD8+DR+CD95- | 0.938 | .887 |

CD3+CD8-CD7-CD154+ | 0.962 | .696 |

CD3+CD8-CD38+CD28- | 0.996 | .797 |

CD3+CD8+DR-CD95- | 1.004 | .797 |

Lin+DR+ | 1.074 | .898 |

CD3-DR-CD56-CD16+ | 1.074 | .898 |

CD3-CD8- | 1.074 | .898 |

Lin+DR- | 1.149 | 1.000 |

CD45+CD3- | 1.160 | .989 |

CD3+CD8-CD38-CD28- | 1.160 | .989 |

CD3+CD8-CD45RA+CD62L- | 1.230 | .898 |

CD45+CD3+CD8+CD4- | 1.317 | .797 |

CD3+DR+ | 1.329 | .786 |

Lin-DR+CD123-CD11c- | 1.329 | .786 |

CD3+CD8-DR-CD95+ | 1.329 | .786 |

CD3+CD8+DR-CD95+ | 1.410 | .699 |

CD45+CD3+CD8+CD4+ | 1.410 | .699 |

CD3+CD8+CD45RA-CD62L+ | 1.422 | .690 |

CD3-DR- | 1.446 | .677 |

CD3-DR+CD56-CD16+ | 1.486 | .661 |

Lin-DR+ | 1.511 | .605 |

CD4+ | 1.522 | .598 |

Lin-DR+CD123-CD11c+ | 1.522 | .598 |

CD3+CD8-CD45RA-CD62L+ | 1.630 | .512 |

CD3-DR-CD56+CD16- | 1.630 | .512 |

CD3+CD8+CD7+CD154- | 1.657 | .502 |

CD3+CD8+CD7-CD154- | 1.707 | .487 |

CD3+CD8-CD38-CD28+ | 1.898 | .354 |

CD3+CD8-CD45RA-CD62L | 2.011 | .294 |

CD3+CD8+CD45RA+CD62L+ | 2.152 | .238 |

We present a univariate analysis and three tree-based algorithms. The tree-based approaches involve recursive splitting of the data, based on the value of predictor variables, in a manner that broadly captures the variability in a single outcome. All three approaches are nonparametric and can be applied in the context of a large number of predictors and a single binary or quantitive trait. Both CART and RFs can handle both quantitative and binary predictor variables, while logic regression requires dichotomous inputs. For clarity of presentation, we dichotomize all of the potential predictors a priori. Further discussion of this, including model sensitivity to choice of inputs, is given in Section

Suppose we have

Measuring and testing the association between a single categorical predictor and a binary outcome is typically achieved through a contingency table analysis. The odds ratio, defined as the odds of disease given exposure, divided by the odds of disease given no exposure, is a well-described measure of association in the this context and is given formally by

Classification and Regression Trees (CARTs) are an alternative, nonparametric approach that allows us to model simultaneously the relationship between an outcome and

Formally, let the node

Once a tree is constructed, as shown in Figure

Classification tree (unpruned).

Random Forest (RF), originally proposed by [

The RF algorithm is summarized by the following step-by-step procedure: (

For each predictor,

Logic regression (LR) is another tree-based approach that is increasingly popular for the analysis of high-dimensional data. LR searches specifically for models that are comprised of combinations of Boolean expressions of the predictors [

We report the results of applying a univariate analysis and each of the tree-based methods described above to data arising from the SASTI trial detailed in Section

The univariate analysis results are provided in Table

An unpruned classification tree, based on a stopping rule of

The results of applying the RF algorithm to these data are given in Figure

Variable importance scores from application of an RF.

Finally, we applied LR to the data and the resulting trees are presented in Figure

Logic regression trees.

First element of LR (

Second element of LR (

The goal of this study is to compare a number of tree-based methods for their capability to select immunological predictors of CD4 reconstitution in HIV-infected subjects initiating antiretroviral treatment. Earlier studies from our group have demonstrated that pre-ART CD95 expression on CD8+ T cells is negatively associated with the frequency of plasmacytoid Dendritic Cells (PDCs) after 52 weeks of treatment [

We describe the application of a univariate approach and three tree-based methods for the analysis of the association between a single trait and multiple variables arising from flow cytometric analysis. Interestingly, for this data example, the univariate contingency table analysis and RFs resulted in similar findings in terms of the ranking of important variables. This may not always be the case, since as we describe in Section

Notably, a high degree of correlation is intrinsic to the variables included in our analysis of flow cytometry data. Specifically, events passing a certain logical gate are assessed for co-expression of two fluorochromes, and separated in quadrants based on the intensity (above or below a certain level) of each fluorochrome. Thus, any increase in the percent of events falling in one quadrant must correspond to a decrease in the percent of events that fall in one or more of the other quadrants. For example, the variables CD3-CD8-, CD3+CD8-, CD3-CD8+, and CD3+CD8+ arise from four quadrants on the same plate for each individual and thus always sum to

This paper represents an attempt to utilize data from experimental and clinical laboratory settings that are available in resource constrained settings. While it is general good scientific practice to avoid unnecessary assessment, limiting stainings and maximizing the usefulness of current resource capacity is paramount in the settings in which these experiments were conducted. Because the use of multicolor flow cytometers is restricted to resource-rich clinical and research settings, we have elected to use the output of more commonly available 4-colour analytical instruments, in the hope that any information gained from this approach is applicable in the resource-constrained settings such as those in which the study was conducted. We also agree that the clinical interpretability of the findings in this data setting is limited. Specifically, the full panel of mAb used for this paper would not be applicable to general practice, particularly in resource constrained settings, due to issues of cost and laboratory capacity. This panel was in fact used in an experimental setting, to investigate in detail the effects of ART on individual immune subsets. However, the purpose of this paper is to identify which, among the baseline, pre-ART stainings performed, could be useful to predict the desired outcome (in this case immune reconstitution as assessed by CD4 counts). We demonstrated how tree-based approaches can be applied to identify a small number of phenotypes that contribute to the selected CD4 recovery outcome. Importantly, many of the cellular subsets (e.g., mature NK cells, myeloid Dendritic cells, CD95-expressiong activated T lymphocytes) selected using the three tree-based methods presented here as being predictive of immune reconstitution have been previously shown to be individually correlated with disease progression and/or immune reconstitution [

Importantly, differences in the insights offered by each of the approaches presented are a reflection of the specific algorithms employed and not the result of one approach being more or less correct than another. The univariate analysis, while methodologically sound, only considers associations that exist between single variables and the outcome. Univariate analyses are not designed to discover variables that are only important conditional on the level of another variable. The CART and RF algorithms, on the other hand, are specifically searching for conditional associations, that is, associations of variables with the outcome within levels of other variables. Finally, logic regression trees allow for discovery of combinations of variables that are predictive, even in the setting in which no single element of the combination is important on its own. That is, both CART and RF split initially on the single most important variable; however, if a combination of two or more variables is important, none of which are predictive individually, then both CART and RF may not find this association [

In summary, each of the tree-based approaches described herein complement univariate analyses of multiparameter defined flow cytometry subsets. These methods are designed specifically to uncover complex structures, and as demonstrated in the example above, allow for discovery of combinations of variables that are together predictive of an outcome. While extensions of these methods, including, for example, the recently proposed approach of [

Support for this research was provided by NIH/NIAID R01AI056983 to M. Eliot and A. S. Foulkes and U01AI51986 to L. J. Montaner.