Estimation of Anti-HIV Activity of HEPT Analogues Using MLR, ANN, and SVM Techniques

The present study deals with the estimation of the anti-HIV activity (log1/C) of a large set of 107 HEPT analogues using molecular descriptors which are responsible for the anti-HIV activity. The study has been undertaken by three techniques MLR, ANN, and SVM. The MLR model fits the train set with R 2=0.856 while in ANN and SVM with higher values of R 2 = 0.850, 0.874, respectively. SVM model shows improvement to estimate the anti-HIV activity of trained data, while in test set ANN have higher R 2 value than those of MLR and SVM techniques. R m 2 = metrics and ridge regression analysis indicated that the proposed four-variable model MATS5e, RDF080u, T(O⋯O), and MATS5m as correlating descriptors is the best for estimating the anti-HIV activity (log 1/C) present set of compounds.


Introduction
Undoubtedly Human immunodeficiency virus (HIV) infection is considered to be a deadly disease by the international community including the World Health Organization (WHO), UNAIDS. The WHO in its reports has said that AIDS has killed more than 25 million people since 1981 which is most the destructive pandemics in the history.
It is also a well-known fact that a lentivirus (a member of the retrovirus family) causes acquired immunodeficiency syndrome (AIDS) [1,2], damaging immune system and leading to life-threatening infections. A report published in 2007 reveals that approximately 36 million people suffered due to HIV infection. An estimated 2.1 million people were even killed that year including 330,000 children. Another study also reveals that 2.5 million people developed new infections [3][4][5][6]. Unfortunately the number of deaths is still rising due to this deadly disease.
Just to overcome the problem scientists are working day and night and a number of RT inhibitors including various nonnucleoside RT inhibitors (NNRTIs) have been discovered as new anti-HIV agents. They have better blocking potential and have been proved to be effective [7][8][9]. These compounds 1-[2-Hydroxyethoxy) methyl]-6-(phenylthio)thymine (HEPT) are known for targeting enzyme allosteric site which are less toxic and found to have more stable than nucleoside RT inhibitors.
Many efforts have been made to model the anti-HIV activity of HEPT derivatives in the past using 2D, 3D, and holographic (HQSAR) methods [10][11][12][13]. Quantitative structure activity relationship studies were carried out in order to build models for the estimation of binding affinities (Δ ) of HEPT and nevirapine analogues with reverse transcriptase [14]. Similarly, Agrawal et al. [15,16] have successfully reported use of physicochemical as well as topological indices for modeling anti-HIV activities of HEPT analogues.
In continuation to these studies we now report modeling of anti-HIV activity of 1-[2-Hydroxyethoxy) methyl]-6-(phenylthio)-thymine (HEPT) derivatives ( Figure 1) using graph theoretical descriptors in which distances and connectivity have been considered. The general structure of HEPT compounds used in the present study is demonstrated in Figure 1. The structural details are presented in Table 1. This Table also shows the experimental anti-HIV activity of compounds.
A close look of Figure 1 and the activity data presented in Table 1 indicates that the anti-HIV activity mainly dependent upon the type and number of substituent 1 in the benzene moiety.

Experimental Data.
The structural details as well as anti-HIV activity (log 1/ ) of 107 HEPT analogues are reported in Table 1. The RT inhibition data in terms of log 1/ have been taken from the literature [12]. All the chemical structures were drawn with the help ACD labs software which helps in the calculation of topological descriptors. These descriptors were calculated using Dragon software using mol file generated by Chem sketch software.

Selection Molecular Descriptors and Training/Test Set for External Validation.
In the present study for estimating the anti-HIV activity of 107 HEPT analogues we have used a pool of descriptors classified into 20 different groups. The descriptor selection is carried out by stepwise regression analysis (forward selection method using NCSS ver. 8 [17]. These selected descriptors are recorded in Table S1 (see Supplementary Material available online at http://dx.doi.org/10.1155/2013/795621). The data set was divided into training and test sets using random sampling technique in which 80% (84 compounds) of the data is taken as training set and the remaining 20% (23 compounds) as test set for the MLR, ANN, and SVM analyses.

Results and Discussion
The data ( Table 2) was subjected to regression analysis which subsequently gave a correlation matrix showing intercorrelation among the selected descriptors and also with the anti-HIV (log 1/ ) activity. The same has been presented in Table S2. The variable selection for multiple regression analysis has indicated the possibility of using only ten models for modeling the anti-HIV activity (log 1/ ). These models are reported in Table S3. All these models are generated as a result of successive addition of one to ten descriptors. However, correlation of number of descriptors present in the model with 2 ( Figure 2) indicated that at the most we can use only four to five descriptors for obtaining statistically Here and hereafter is number of compounds, Se is standard error, 2 is squared correlation coefficient, 2 is adjusted 2 , -ratio is Fishers ratio, and is Pogliani quality factor [18][19][20].
The negative coefficient of MATS5e indicates that the decrease in its magnitude will enhance the activity (log 1/ ).

Two-Variable
International Journal of Medicinal Chemistry 3 Table 1: Structural details of the compounds with their anti-HIV activity (log1/ ) values used in the present study.    The above model indicates that decrease in MATS5e and increase in RDF080u will improve the log 1/ values.

Three-Variable Model. When T(O ⋅ ⋅ ⋅ O)
, which is a parameter which takes care of distance between O atom, is added to the previously stated two-parametric model a threeparametric model is yielded as below. Here the change in 2 and also value suggests that the model is better than the earlier one: Here the coefficient of MATS5 m is negative. This indicates that lower value of MATS5 m will favour the log 1/ value for the compounds used in the present study.
A close look at (4) reveals that MATS5e (Moran autocorrelation-lag 5/weighted by atomic Sanderson electronegativites) and MATS5 m (Moran auto correlationlag 5/weighted by atomic masses) play dominant role in exhibiting the activity. They belong to 2D autocorrelation category. The brief description of the descriptors is given in Table 2. The predicted log 1/ values of training set compounds using the above four-parametric model are recorded in Table 1 and plotted against their experimental values. Such a correlation is demonstrated in Figure S1. The above reported model (4) has further been used to predict the log 1/ values of remaining 23 compounds which are in test set. Such predicted values are also recorded in Table 1. The predicted 2 value for the model has been obtained by plotting a graph between observed and estimated log 1/ values for the compounds and is demonstrated in Figure S1. The 2 pred . comes out to be 0.814 confirming that the proposed model is meaningful.
The above findings confirm that for the estimation of anti-HIV activity (log 1/ ) of present set of compounds a fourvariable model containing MATS5e, RDF080u, T(O ⋅ ⋅ ⋅ O), and MATS5m as correlating descriptors is the most appropriate model. The Ridge analysis (Table 3) indicates that all the Ridge parameters are well within the allowed values indicating that the proposed model is most suitable and statistically significant. Ridge trace and variance inflation factor for the four variable model were recorded in Figures  3 and 4, respectively.
These four descriptors were further used in artificial neural network (ANN) and support vector machine (SVM) techniques. However the methodology, validation techniques, and model performance evaluation by these two methods is previously discussed by Agrawal et al. [21][22][23]. The observed and predicted values of log 1/ of the training as well as the test data using the ANN and SVM techniques are reported in Table S4.

ANN and SVM Results. Artificial neural network (ANN)
and support vector machine (SVM) analyses were carried out using STATISTICA Data Miner software Ver. 10 [24]. The initial architecture of the ANN selected was four neurons in the input layer and three neurons in the hidden layer and one output neuron selected by automated network search function. The input neurons correspond to four selected descriptors of the best MLR model. The optimization was done with 10-fold cross-validation. When the entire training data is trained in the network it gives 2 = 0.850, RMSE = 2.193, and MAE = 0.24. Using the trained network the test set was used for prediction and gives 2 = 0.878, RMSE = 0.823, and MAE = 0.171. A plot of the observed and predicted values of log 1/ of the training as well as the test data using the ANN model is shown in Figure S2.
The SVM regression type 1 was selected for training the data to obtain capacity and Epsilon ( ) and gamma ( ) values. In order to find the optimum values of two parameters ( and ), the tenfold cross-validation based on the training set was performed and values giving the lowest RMSE were selected. Using the selected parameters ( = 0.14,   1.87, 2 = 0.867, and MAE = 0.393. A plot of the observed and predicted values of log 1/ of the training as well as the test set using the SVM model is shown in Figure S3. The average 2 and Δ 2 were calculated for judging the quality of the proposed model using 2 metric method. It is well established that for an acceptable QSAR model the value of "average 2 " should be >0.5 and "Δ 2 " should be <0.2 [25,26]. In our study two different variants of this parameter, 2 and Δ 2 , were calculated for both the training (internal validation) and test (external validation) sets in addition to the total dataset (overall validation). The 2 , Δ 2 values for all the training, test, and overall data set (MLR, ANN, and SVM) are reported in Table 4.
A close observation of this table clearly indicates that all the values obtained from the 2 metrics are in favour of the four-parametric model proposed by us.
Randomization test is performed to investigate the probability of chance correlation for the best models. Generally in randomization test the dependent variable (log 1/ ) is randomly shuffled and new QSAR models are investigated using the original descriptors. After performing the test, the results indicate that the coefficient of determination obtained by chance is low while the RMSE values are high. This clearly indicates that the models obtained in this study are better than those obtained by chance. The randomization test results are shown in Figure S4.

Comparison with Other QSAR Studies.
Luco and coworkers [12] proposed QSAR-based multiple regression analysis and pls methods for anti-HIV activity of 107 HEPT analogues. They developed QSAR-based models on the entire dataset and found that the best model involves 11 correlating descriptors with statistical quality given by 2 = 0.9044. It is interesting to compare our results with the results of Luco and coworkers. Our model is with four correlating parameters having the 2 = 0.856 in training set and 2 Pred = 0.814 in test case. MLR technique is better than the previously reported one by Luco et al.; in addition to that we have also applied ANN (artificial neural network) and SVM (support vector machine) techniques in which the statistical parameters are better especially with ANN method.

Conclusions
A comparison of results from the model performance demonstrates that the SVM model predicts the binding affinity of the compounds more accurately than ANN and MLR models for the train dataset. While for test set prediction, ANN model was better. The proposed models could identify and provide some important information which is responsible for anti-HIV activity. These models could be used for designing new HEPT derivatives.