^{1, 2}

^{1, 2}

^{1, 2, 3}

^{4}

^{1, 2, 3}

^{1, 2, 3}

^{1}

^{2}

^{3}

^{4}

The effectiveness of antiretroviral therapy has been limited by the development of human immunodeficiency virus type 1 (HIV-1) drug resistance. HIV-1 frequently develops resistance to the antiretroviral drugs used to treat it which may decrease both the magnitude and the duration of the response to treatment resulting in loss of viral suppression and therapeutic failure [

Genotypic or phenotypic assays are used for resistance testing each, assay having advantages and limitations. From those assays we used either the genotypic-phenotypic correlation, showing phenotypic effect of mutations, or the genotypic-virologic correlation, investigating the impact of mutations on the virological response to a subsequent treatment. The latter correlation is mainly used by the

In both cases many HIV-1 drug resistance analysis approaches have been explored, from simple linear models [

A framework for the unified loss-based estimation suggested a solution to this problem in the form of a new estimator, called the “Super Learner” [

Originally, the Super Learner used both mean square of residuals (differences between observed and predicted outcomes) and

Our aim is to study the performance of the discrete and the most recent Super Learner methodology on a small sample of HIV-1 data from a randomized clinical trial. Especially, based on this methodology, we investigate four different cross-validation setting, and the use of two loss functions for six statistical learning methods. This methodology is applied on the Jaguar trial data [

For a patient

The methodology has been proposed by Mark van der Laan et al. [

We applied all individual learners and the new estimator on full dataset (which will be called full model in the following). Learners are ranked from those identified as top learners to those providing poor performance. We investigate four splits: 10-fold, 4-fold, 3-fold, and 2-fold that correspond to 90%, 75%, 66%, and 50% of data use as training samples and 10%, 25%, 33%, and 50% as validation sample respectively. Learners were evaluated using two distinct functions usually used as loss functions: squared error (SqE) and first-order coefficient (

We defined two threshold values to define patients having a virologic response:

We investigate the following learners: Logic Regression, Deletion/Substitution/Addition, Least squares regression, Random Forest, Classification and Regression Trees. All algorithms are available as free packages of

Logic Regression (package named

From those learners, we set up two Super Learners: Super Learner using five learners, built with D/S/A, LM(1), LM(2), random forest and CART (noted Super Learner-5 in the following), and Super Learner with six learners, the same as Super Learner-5 plus Logic Regression (denoted as Super Learner-6 in the following).

Internal fine-tuning procedure by internal cross-validation was used to obtain the best performance for Logic Regression and D/S/A. The tuning parameters of D/S/A were

Results of the Discrete Super Learner and Super Learner-5 are given in Table

Squared error,

SqE without Logic Reg. | |||||||||

10-fold | 4-fold | 3-fold | 2-fold | ||||||

Method | Rank | Mean | Rank | Mean | Rank | Mean | Rank | Mean | Mean rank |

LM(1) | 1.5 | 0.216 | 3 | 0.246 | 3 | 0.238 | 3 | 0.293 | 2.625 |

LM(2) | 6 | 1.218 | 6 | 1.267 | 6 | 1.650 | 6 | 1.117 | 6 |

Random Forest | 3 | 0.258 | 2 | 0.241 | 2 | 0.235 | 2 | 0.275 | 2.25 |

D/S/A | 5 | 0.283 | 4 | 0.264 | 4 | 0.255 | 4 | 0.295 | 4.25 |

CART | 4 | 0.264 | 5 | 0.267 | 5 | 0.258 | 5 | 0.298 | 4.75 |

Super Learner-5 | 1.5 | 0.216 | 1 | 0.238 | 1 | 0.228 | 1 | 0.273 | 1.125 |

10-fold | 4-fold | 3-fold | 2-fold | ||||||

Method | Rank | Mean | Rank | Mean | Rank | Mean | Rank | Mean | Mean rank |

LM(1) | 1.5 | 0.464 | 3 | 0.554 | 3 | 0.534 | 1.5 | 0.651 | 2.25 |

LM(2) | 6 | 0.808 | 6 | 0.724 | 6 | 0.686 | 6 | 0.754 | 6 |

Random Forest | 3 | 0.609 | 2 | 0.552 | 2 | 0.532 | 3 | 0.656 | 2.5 |

D/S/A | 5 | 0.712 | 4 | 0.623 | 4 | 0.607 | 5 | 0.746 | 4.5 |

CART | 4 | 0.632 | 5 | 0.644 | 5 | 0.611 | 4 | 0.743 | 4.5 |

Super Learner-5 | 1.5 | 0.464 | 1 | 0.539 | 1 | 0.508 | 1.5 | 0.651 | 1.25 |

Squared error,

SqE with Logic Reg. | |||||||||

10-fold | 4-fold | 3-fold | 2-fold | ||||||

Method | Rank | Mean | Rank | Mean | Rank | Mean | Rank | Mean | Mean rank |

LM(1) | 1 | 0.216 | 2 | 0.246 | 2 | 0.238 | 2 | 0.293 | 1.75 |

LM(2) | 7 | 1.218 | 7 | 1.267 | 7 | 1.650 | 7 | 1.117 | 7 |

Random Forest | 2 | 0.258 | 1 | 0.241 | 1 | 0.235 | 1 | 0.275 | 1.25 |

D/S/A | 4 | 0.283 | 3 | 0.264 | 3 | 0.255 | 3 | 0.295 | 3.25 |

CART | 3 | 0.264 | 4 | 0.267 | 4 | 0.258 | 4 | 0.298 | 3.75 |

LogicReg | 6 | 0.653 | 6 | 0.65 | 6 | 0.652 | 6 | 0.653 | 6 |

Super Learner-6 | 5 | 0.378 | 5 | 0.455 | 5 | 0.499 | 5 | 0.527 | 5 |

10-fold | 4-fold | 3-fold | 2-fold | ||||||

Method | Rank | Mean | Rank | Mean | Rank | Mean | Rank | Mean | Mean rank |

LM(1) | 1.5 | 0.464 | 3 | 0.554 | 3 | 0.534 | 2 | 0.651 | 2.375 |

LM(2) | 7 | 0.808 | 7 | 0.724 | 7 | 0.686 | 7 | 0.754 | 7 |

Random Forest | 3 | 0.609 | 2 | 0.552 | 2 | 0.532 | 3 | 0.656 | 2.5 |

D/S/A | 6 | 0.712 | 4 | 0.623 | 4 | 0.607 | 6 | 0.746 | 5 |

CART | 4 | 0.632 | 5 | 0.644 | 5 | 0.611 | 5 | 0.743 | 4.75 |

LogicReg | 5 | 0.702 | 6 | 0.685 | 6 | 0.684 | 4 | 0.657 | 5.25 |

Super Learner-6 | 1.5 | 0.456 | 1 | 0.523 | 1 | 0.485 | 1 | 0.593 | 1.125 |

We applied all learners including Super Learner-5 and Super Learner-6 on the entire dataset (Table

Squared Error,

Full Model | SqE | ||||||||

Method | Rank | Value | Rank | Value | Rank | Value | |||

LM (1) | 5 | 0.204 | 5 | 0.435 | 4 | 0.319 | |||

LM (2) | 1.5 | 0.138 | 1.5 | 0.265 | 1.5 | 0.540 | |||

Random Forest | 4 | 0.178 | 4 | 0.348 | 6 | 0.271 | |||

D/S/A | 7 | 0.242 | 7 | 0.561 | 7 | 0.193 | |||

CART | 6 | 0.211 | 6 | 0.454 | 5 | 0.299 | |||

Super Learner-5 | 1.5 | 0.138 | 1.5 | 0.265 | 1.5 | 0.540 | |||

Super Learner-6 | 3 | 0.139 | 3 | 0.266 | 3 | 0.539 |

Selected mutations for each model on the complete Jaguar data Trial.

The final goal of interpreting genotypic resistance testing is to classify patients as “sensitive” or “resistant” to a specific drug. Figure

Rates of patients being well classified for threshold −0.5 and

The choice of subsequent treatment in failing patients is of major importance in the management of HIV-infected patients. Genotypic and phenotypic resistance tests are important tools for choosing promising combination therapy for those patients. We investigated on a small sample a framework both for choosing optimal learner and building an estimator among a set of candidate through two different loss functions and

Based on cross-validation risk, the Super Learner estimator was the “best” learner though the linear model with only main terms LM(1) providing similar performance to that of Super Learner-5 and -6. The use of the SqE as loss function indicated that the inclusion of Logic Reg as an additional learner decreased the performance of the Super Learner estimator. However, prediction results based on the full dataset as well as accuracy questioned the use of SqE as loss function, although it is known that full dataset provided different results than those based on cross-validation strategy [

The choice of

The HIV-1 resistance study used either a continuous outcome (as HIV-1 RNA reduction from baseline to the time of interest) or a categorical outcome (classifying patients as achieving a virologic response at the time of interest). For example, virologic response can be defined an HIV-1 reduction of

All the methods used in this work are usually applied to large or very large datasets. Simple linear regression model was fitted on more than 5,000 genotype-phenotype paired datasets from the same database [

A major reason to apply the Super Learner methodology on the Jaguar trial is that often the first version of an algorithm for a specific drug is based on a limited amount of data [

It has been shown that, in the context of genotype-phenotype correlation with a large database, the linear model without interactions provided also accurate predictions [

In this study, we showed that the Super Learner methodology applied on a relative small amount of data, provided good performance. Of note in our dataset, simple linear regression with two-way interaction terms performs as well as the Super Learner.

A. Houssaïni and P. Flandre designed research; A. Houssaïni and P. Flandre performed analysis; A. Houssaïni, L. Assoumou, A. G. Marcelin, J. M. Molina, V. Calvez and P. Flandre discussed the results and improved the paper.

P. Flandre has received travel Grant or consulting fees from Abbott, Bristol-Myers Squibb, Gilead, Janssen-Tibotec and ViiV Health Care. Dr. Molina has received travel Grant or consulting fees from Bristol-Myers Squibb. A. Houssaïni, L. Assoumou, A. G. Marcelin and V. Calvez have none to declare.

The authors thank Bristol-Meyer Scribb (Dr. Yacia Bennai) for making the Jaguar study data available. A. Houssaïni is grateful for the financial support of SIDACTION in the form of a Ph. D. fellowship. This work was supported by SIDACTION Grant FJC-1/AO20-2/09, the Agence Nationale de Recherches sur le SIDA (ANRS), the European Community Seventh Framework Programme (FP7/2007–2013) under the project “Collaborative HIV and Anti-HIV Drug Resistance Network (CHAIN)”, grant agreement no. 223131, and ARVD (Association de Recherche en Virologie en Dermatologie).