Investigation of Super Learner Methodology on HIV-1 Small Sample: Application on Jaguar Trial Data

Background. Many statistical models have been tested to predict phenotypic or virological response from genotypic data. A statistical framework called Super Learner has been introduced either to compare different methods/learners (discrete Super Learner) or to combine them in a Super Learner prediction method. Methods. The Jaguar trial is used to apply the Super Learner framework. The Jaguar study is an “add-on” trial comparing the efficacy of adding didanosine to an on-going failing regimen. Our aim was also to investigate the impact on the use of different cross-validation strategies and different loss functions. Four different repartitions between training set and validations set were tested through two loss functions. Six statistical methods were compared. We assess performance by evaluating R 2 values and accuracy by calculating the rates of patients being correctly classified. Results. Our results indicated that the more recent Super Learner methodology of building a new predictor based on a weighted combination of different methods/learners provided good performance. A simple linear model provided similar results to those of this new predictor. Slight discrepancy arises between the two loss functions investigated, and slight difference arises also between results based on cross-validated risks and results from full dataset. The Super Learner methodology and linear model provided around 80% of patients correctly classified. The difference between the lower and higher rates is around 10 percent. The number of mutations retained in different learners also varys from one to 41. Conclusions. The more recent Super Learner methodology combining the prediction of many learners provided good performance on our small dataset.


Introduction
The effectiveness of antiretroviral therapy has been limited by the development of human immunodeficiency virus type 1 (HIV-1) drug resistance. HIV-1 frequently develops resistance to the antiretroviral drugs used to treat it which may decrease both the magnitude and the duration of the response to treatment resulting in loss of viral suppression and therapeutic failure [1]. Moreover, there is a high level of cross-resistance within drug classes; a virus that has developed resistance to one drug in a class may also be resistant to other drugs in the same class [2]. Current International AIDS Society USA and French report HIV-1 guidelines recommend resistance testing both before starting antiretroviral therapy (ART) and at treatment failure. Resistance testing has become an important part of choosing and optimizing combination therapy for treating HIV-infected individuals [3]. Selecting a "salvage" regimen for an HIV-infected patient who has developed resistance to his or her current regimen is not straightforward [4].
Genotypic or phenotypic assays are used for resistance testing each, assay having advantages and limitations. From those assays we used either the genotypic-phenotypic correlation, showing phenotypic effect of mutations, or the genotypic-virologic correlation, investigating the impact of mutations on the virological response to a subsequent treatment. The latter correlation is mainly used by the Agence Nationale de Recherches sur le SIDA to build rule-based algorithms (ANRS http://www.hivfrenchresistance.org/). The 2 AIDS Research and Treatment increasing number of antiretroviral drug-resistance-associated mutations has increased the difficulty of the interpretation of those assays [5].
In both cases many HIV-1 drug resistance analysis approaches have been explored, from simple linear models [6] to more sophisticated ones, such as database pattern search method [7], neural networks/machine learning [8][9][10][11], or genotype-phenotype mapping [12]. Such methods, or learners, differ by the mechanism used to search over the space of parameters. It appears that different interpretation systems lead to distinct results [13][14][15]. Current widely used genotypic interpretation systems may have no satisfactory performance on newly derived datasets. Such poor performances emphasize the need for an external validation dataset or a sufficient large database to create a validation set. It has been shown that the variability observed in different rulebased algorithms was mainly due to the patients' baseline characteristics than to the statistical methods used [16,17].
A framework for the unified loss-based estimation suggested a solution to this problem in the form of a new estimator, called the "Super Learner" [18,19]. Initially this methodology, called Discrete Super Learner, compared different learners (methods) on the basis of the loss-based estimation theory and choose the optimal learner for a given prediction problem based on cross-validated risk (repartition between training sample and validation sample) [20]. The Super Learner methodology has been improved building now an estimator based on a linear combination of the different learners investigated [19,21,22].
Originally, the Super Learner used both mean square of residuals (differences between observed and predicted outcomes) and R 2 for evaluation and assessment. However, statistical investigations showed the importance of exploring different loss functions [23], such as first-order coefficient R.
Our aim is to study the performance of the discrete and the most recent Super Learner methodology on a small sample of HIV-1 data from a randomized clinical trial. Especially, based on this methodology, we investigate four different cross-validation setting, and the use of two loss functions for six statistical learning methods. This methodology is applied on the Jaguar trial data [24].

Datasets.
For a patient i, the data consist of a vector X i of binary variables indicating presence or absence of a mutation and Y i denotes the virologic outcome. In the regression setting, the objective is to predict Y using X. Then, the parameter of interest is denoted as E(Y |X). We analyzed the data obtained from the Jaguar trial which are described elsewhere [24]. Briefly the Jaguar trial was a randomized multicenter, double-blind placebo-controlled trial evaluating the efficacy of adding didanosine (ddI) to an on-going antiretroviral (ARV) regimen. Patients were randomly assigned at a ratio 2 : 1 to receive ddI or a matching placebo added to their current regimen. The primary efficacy end point was the magnitude of change in plasma HIV-1 RNA levels in log 10 copies per mL from baseline to week 4. The naïve method was used to compute viral load reduction; that is, all HIV-1 RNA levels <50 copies/mL at week four were fixed at 50 copies/mL. Although censored methods are preferred to compute HIV-1 RNA changes, the low percentage (11%) of patients censored provides in this case an unbiased estimate [25][26][27]. The median changes in HIV-1 RNA at week 4 were −0.56log 10 copies/mL (IQR, −0.14 to −1.2) and +0.07log 10 copies/mL (IQR, 0.12 to 0.21) in patients receiving ddI and placebo, respectively (P < .0001). HIV-1 sequences were available for all patients, but only patients in the ddI group were used in the present work. HIV-1 sequences and HIV-1 RNA reduction at week 4 were available for 102 patients. Mutations were defined as amino acid differences from subtype B consensus wild-type sequence (wild-type virus HXB2). We investigate the virologic impact at week 4 of ten resistance mutations: M41L (prevalence 48%), D67N (34.3%), T69D (8.8%), K70R (26.5%), L74V (8.8%), V118I (18.6%), M184VI (92.2%), L210W (27.5%), T215Y/F (53.9%), and K219Q/E (24.5%). This set has been the starting point for building ANRS ddI rules and was potentially linked to the ddI resistance at the time of the study. Moreover, the choice of using a subset of mutations is driven by Soo Yon Rhee et al. study [28], in which they show that expert mutation selection is preferable than using the entire sequences.

Super Learner. The methodology has been proposed by
Mark van der Laan et al. [18,19] as a setting to choose the optimal learner (method) among a set of candidate learners, this version of the methodology was called the Discrete Super Learner. Recently, the methodology has been refined and proposed a new estimator based on a weighted linear combination of candidate learners to build a Super Learner estimator [19,21,22]. We briefly introduced the general principle and few key features of this methodology. The general strategy for loss-based estimation is driven by the choice of a loss function and relies on cross-validation for estimator selection and performance assessment. Crossvalidation divides the available dataset into k mutually exclusive and exhaustive sets of as nearly equal size as possible. Each set and its complement play the role of the validation and training samples. Observations in the training set are used to construct (or train) the estimators, and observations in the validation set are used to assess the performance (or validate) of the estimators. For each estimator/learner the k risks over the k validation sets are averaged resulting in the so-called cross-validated risk. For example, with a 10fold cross-validation the learning set is partitioned into 10 parts, each part in turn served as a validation set, while the other 9/10ths of the data served as the training set. Based on cross-validated risks, estimators/learners can be ranked from those identified as top learners to those providing poor performance. In the discrete version of the methodology, the optimal learner is applied to the entire dataset. In the most recent version, a new estimator (the Super Learner) is proposed based on a family of weighted combinations of the estimators/learners. The new Super Learner appears as a generalization of the discrete Super Learner.
We applied all individual learners and the new estimator on full dataset (which will be called full model in the following). Learners are ranked from those identified as top learners to those providing poor performance. We investigate four splits: 10-fold, 4-fold, 3-fold, and 2-fold that correspond to 90%, 75%, 66%, and 50% of data use as training samples and 10%, 25%, 33%, and 50% as validation sample respectively. Learners were evaluated using two distinct functions usually used as loss functions: squared error (SqE) and first-order coefficient (R). The SqE is (Y − E(YX)) 2 , that is, the squared difference between observed and predicted outcome. R is the first-order correlation coefficient between Y and E(Y |X), which has been recently used in this context [29]. It is important to note that SqE is unbounded while −1 ≤ R ≤ 1. For all full models, R 2 estimates and accuracy were also computed in addition to SqE and R. We defined two threshold values to define patients having a virologic response: −0.6log 10 copies/mL and −0.5log 10 copies/mL. For example, a patient with an HIV-1 RNA reduction larger than 0.6log 10 copies/mL was classified as responder, otherwise as nonresponder. Patients may also be classified responders or not according to the predicted reduction by a given method.

Candidate Learners
We investigate the following learners: Logic Regression, Deletion/Substitution/Addition, Least squares regression, Random Forest, Classification and Regression Trees. All algorithms are available as free packages of R software.
Logic Regression (package named LogicReg) is an adaptive regression methodology that attempts to construct predictors as Boolean combinations of covariables [30]. Deletion/Substitution/Addition (package named DSA) is polynomial regression dataadaptive that generates candidate predictors as polynomial combinations of binary covariables [31]. Classification and Regression Trees (CARTs) build a regression tree in continuous outcome setting (package rpart) [32]. Random Forest (package RandomForest) is a "bagging predictor" (Bootstrap Aggregating), this method build a model from a combination of high number of regression trees resulting in the so-called Forest [33]. Least squares regression was set up on two datasets: one consisted of all main terms and the second consisted of all main terms plus all twoway interactions (resp. denoted as LM(1) and LM (2)).
From those learners, we set up two Super Learners: Super Learner using five learners, built with D/S/A, LM(1), LM(2), random forest and CART (noted Super Learner-5 in the following), and Super Learner with six learners, the same as Super Learner-5 plus Logic Regression (denoted as Super Learner-6 in the following).
Internal fine-tuning procedure by internal cross-validation was used to obtain the best performance for Logic Regression and D/S/A. The tuning parameters of D/S/A were maxsize = 20 (two times the number of co-variables), maxorderint = 2 and maxsumofpow = 2. All three steps were allowed (Deletion, Substitution, and Addition). CART has complexity parameter (cp) equal to 0.01. For Random Forest the number of trees was 1,000 and the number of variables to randomly consider at each node of each tree was fixed at three (m try = 3). That corresponds to the number of co-variables divided by 3 which is usually used in regression setting. Simple linear regression was used as reference (without variable selection procedure). Methods were ranked; if two or more methods produced the same risk value, the mean rank was assigned (e.g., if Super Learner-5 and LM (1) gave the same SqE, in spite of assigning rank 1 and 2, resp., we noted 1.5 for both).

Results
Results of the Discrete Super Learner and Super Learner-5 are given in Table 1. For example, based on the SqE as loss function and a 10-fold cross-validation, LM(1) was identified as the top learner followed by Random Forest and CART. LM(1) slightly decreases its performance from the 1st rank on 10-fold to 3th rank on 2-fold while Random Forest becomes the second learners for the remaining k-folds. Surprisingly, linear model with interaction terms, LM(2), provided poor performance for all k-fold. The Super Learner-5 provided at least as good performance as the top learner whatever the k-fold cross-validation. R loss function drew similar findings. Although the ranks of the different learners are relatively stable, the combination of the Super Learner-5 provided the best performance. Inclusion of Logic Reg as additional learner in the previous set of candidate learners led to different findings ( Table 2). Globally Logic Reg performed poorly, and only LM(2) produced worse performance than Logic Reg. Based on the SqE as loss function, including Logic Reg in the Super Learner-6 decreased its performance compared to Super Learner-5. Based on R as loss function, the performance of the Super Learner-6 was very good. We applied all learners including Super Learner-5 and Super Learner-6 on the entire dataset (Table 3). Based on SqE, R, and R 2 measure estimates, Super Learner-5 and −6 provided very good performances. The use of LM(2) on the full dataset provided a high level of prediction (R 2 = 0.540) while, based on k-fold cross-validated risk, this learner was the poorest candidate. Comparing cross-validation and full model results indicate the LM(2) model was over fit. Figure 1 displays the mutations retained by each learner. All mutations were retained for LM(1), LM(2), and Random Forest (not surprisingly all mutations are at least selected one time in a tree). CART selected M41L, D67N, T69D, K70R, L74V, and K219Q/E mutations. Of note the D/S/A method selected only the M41L mutation which should be balanced with its poor performance.

AIDS Research and Treatment
The final goal of interpreting genotypic resistance testing is to classify patients as "sensitive" or "resistant" to a specific drug. Figure 2 displays the rates of patients being well classified for the two threshold values investigated. For both threshold values LM(2), Super Learner-5 and −6 have the highest accuracy with around 80% of patients correctly classified. CART and Random Forest provided the lowest accuracy, slightly below 70% of patients correctly classified, corresponding to a 10% difference. As expected the accuracy of Random Forest model depends on the m try values.

Discussion
The choice of subsequent treatment in failing patients is of major importance in the management of HIV-infected patients. Genotypic and phenotypic resistance tests are important tools for choosing promising combination therapy for those patients. We investigated on a small sample a framework both for choosing optimal learner and building an estimator among a set of candidate through two different loss functions and k-fold cross-validation. Based on cross-validation risk, the Super Learner estimator was the "best" learner though the linear model with only main terms LM(1) providing similar performance to that of Super Learner-5 and -6. The use of the SqE as loss function indicated that the inclusion of Logic Reg as an additional learner decreased the performance of the Super Learner estimator. However, prediction results based on the full dataset as well as accuracy questioned the use of SqE as loss function, although it is known that full dataset provided different results than those based on cross-validation strategy [34,35]. Based on cross-validation risk, the good performance of LM(1) should be compared with the poor performance of the linear model with interaction terms LM(2). Inversely, LM(2) outperforms LM(1) in the full dataset. In our small dataset, this finding is clearly due to overfit of the data by the LM(2) model. A researcher ignoring the Super Learner methodology using a linear model with interaction terms would obtain a good performance on the full dataset while such a learner would have not been selected from the discrete Super Learner methodology.
The choice of m try parameter for Random Forest is a real problem. However, the common m try used in regression setting (number of covariables divided by three) appears as a good compromise. Whatever the m try value is, all mutations were selected at least on time using Random Forest on full  dataset. This was expected due to the relative small number of mutations compared with 1,000 trees generated by the Random Forest model The HIV-1 resistance study used either a continuous outcome (as HIV-1 RNA reduction from baseline to the time of interest) or a categorical outcome (classifying patients as achieving a virologic response at the time of interest). For example, virologic response can be defined an HIV-1 reduction of 1.5log 10 copies/mL or more or having a viral load >50 copies/mL at the time of interest. Even if a continuous outcome is preferable as being more informative, the final goal of determining the drug resistance mutations associated with a poorer virologic response is to classify patients as "sensible" or "resistant" to a specified drug. The former patients would receive the corresponding drug as a part of their regimen while the latter patients would not. We used two threshold values of −0.5 and −0.6log 10 copies/mL to define virologic response. For both threshold values LM(2), Super Learner-5 and -6 provided the highest accuracy with approximately 80% of patients correctly classified.
All the methods used in this work are usually applied to large or very large datasets. Simple linear regression model was fitted on more than 5,000 genotype-phenotype paired datasets from the same database [6]. Investigation of logistic regression and nonlinear machine learning for predicting response to antiretroviral treatment was done on more than 3,000 treatment change episodes from the EuResist database [34]. All these analyses were made retrospectively mainly for comparing different methods rather than for building rulebased algorithm.
A major reason to apply the Super Learner methodology on the Jaguar trial is that often the first version of an algorithm for a specific drug is based on a limited amount of data [35][36][37]. Such algorithms are updated later with publication of new data. Nonparametric methods are then often used on such a relative small amount of data [38,39]. Parametric methods have the advantage of not only integrating two-way interactions terms but also adjusting for some other variables that improve the prediction. Randomized clinical trials, in treatment experienced patients, provide frequently the first opportunity to investigate the impact of baseline mutations in the subsequent virologic response in those patients. It was then of interest to know whether the Super Learner methodology applied only on around one hundred of patients was able to produce the "best" learner on the basis of accuracy and prediction. The Jaguar trial which is an "add-on" study ensuring a good quality of relation between reverse transcriptase mutations and effect on the drug investigated, was a good opportunity for such investigation.

AIDS Research and Treatment
It has been shown that, in the context of genotype-phenotype correlation with a large database, the linear model without interactions provided also accurate predictions [6]. However, based on the full dataset results, we highlight the importance of the two-way interactions terms for Least Squares. Interactions between mutations are of scientific interest, both to help in drug selection and to understand mechanisms of resistance.

Conclusion
In this study, we showed that the Super Learner methodology applied on a relative small amount of data, provided good performance. Of note in our dataset, simple linear regression with two-way interaction terms performs as well as the Super Learner.