QSAR Modeling and Molecular Docking Analysis of Some Active Compounds against Mycobacterium tuberculosis Receptor (Mtb CYP121)

A quantitative structure-activity relationship (QSAR) study was performed to develop a model that relates the structures of 50 compounds to their activities against M. tuberculosis. The compounds were optimized by employing density functional theory (DFT) with B3LYP/6-31G⁎. The Genetic Function Algorithm (GFA) was used to select the descriptors and to generate the correlation model that relates the structural features of the compounds to their biological activities. The optimum model has squared correlation coefficient (R2) of 0.9202, adjusted squared correlation coefficient (Radj) of 0.91012, and leave-one-out (LOO) cross-validation coefficient (Qcv2) value of 0.8954. The external validation test used for confirming the predictive power of the built model has R2pred value of 0.8842. These parameters confirm the stability and robustness of the model. Docking analysis showed the best compound with high docking affinity of −14.6 kcal/mol which formed hydrophobic interaction and hydrogen bond with amino acid residues of M. tuberculosis cytochromes (Mtb CYP121). QSAR and molecular docking studies provide valuable approach for pharmaceutical and medicinal chemists to design and synthesize new anti-Mycobacterium tuberculosis compounds.


Introduction
Mycobacterium tuberculosis is a bacterial species responsible for causing tuberculosis (TB). It mainly affects the lungs and other parts of the body such as spine, kidney, and brain unless urgent treatment is provided. Tuberculosis remains one of the most prevalent infectious bacterial diseases, resulting in the death of 1.4 million people worldwide [1]. There are drugs like isoniazid, rifampicin, ciprofloxacin, and ethambutol available as a cure for tuberculosis.
However, due to the emergence of multidrug-resistant (MDR) and extensively drug-resistant (XDR) tuberculosis, this poses a big challenge towards the successful treatment of tuberculosis [2]. This led to development of new therapeutics against diverse strains of M. tuberculosis [3]. New synthesized 1,2,4-Triazole derivative compounds demonstrate tuberculosis inhibition activity [4]. Synthesis of novel molecules is typically developed using a trial-and-error approach, which is time-consuming and costly.
Quantitative structure-activity relationship (QSAR) plays a crucial part in novel drug design via a ligand-based approach [5]. The key success of the QSAR method is the possibility to predict the properties of new chemical compounds without the need to synthesize and test them. This technique is broadly utilized for the prediction of physicochemical properties in the chemical, industrial, pharmaceutical, biological, and environmental spheres [6]. Moreover, the QSAR strategies save resources and accelerate the process of developing new molecules for use as drugs, materials, and additives or for whatever purposes [7]. Meanwhile molecular docking is a computational method used to determine the binding strength between the active site residues and specific molecule(s) [8]. Molecular docking is expedient tool used in the drug discovery field to investigate the binding compatibility of molecules (ligands) to target (receptor) [9].
The aim of this research was to develop QSAR model to predict the activity of 1,2,4-Triazole derivatives as potent anti-Mycobacterium tuberculosis compounds and to elucidate the

Data Collection.
Fifty molecules of 1,2,4-Triazole derivatives as potent antitubercular agents that were used in this study were obtained from the literature [4]. (BA). The biological activities of 1,2,4-Triazole derivatives against Mycobacterium tuberculosis in aerobic active stage were initially expressed in percentage (%) and then converted to logarithm unit using (1) in order to increase the linearity and approach normal distribution of the activity values. The observed structures and the biological activities of these compounds were presented in Table 1.

Geometry Optimization.
The chemical structures of the molecules were drawn with ChemDraw Ultra Version 12.0. Each molecule was first preoptimized with the molecular mechanics (MMFF) and further reoptimized with density functional theory (DFT) utilizing the B3LYP and 6-31G * basis set [10,11]

Normalization and Data Pretreatment. The descriptors'
values were normalized using (2) in order to give each variable the same opportunity at the onset to influence the model [12].
where is the value of each descriptor for a given molecule and max and min are the maximum and minimum value for each column of descriptors . The normalized data were subjected to pretreatment using data pretreatment software in order to remove noise and redundant data.

Training and Test
where ME is the mean effect of a descriptor j in a model, is the coefficient of the descriptor in that model, is the value of each descriptor in the data matrix for each molecule in the training set, is the number of descriptors that appear in the model, and is the number of molecules in the training set [13].

Degree of Contribution of Selected Descriptors.
Contribution of each descriptor in the model was measured by calculating their standardized regression coefficients ( ) using (6).
where is the regression coefficient of descriptor . and are the standard deviations for each descriptor and activity, respectively. Statistical property allows one to assign a greater importance to those molecular descriptors that exhibit larger absolute standardized coefficients 2.9. Internal Validation of Model. Internal validation of the model was carried out using Materials Studio Version 8 software by employing the Genetic Function Approximation (GFA) method. The models were estimated using the LOF. The LOF is measured using a slight variation of the original Friedman formula, so that the best fitness score can be received. LOF is expressed as follows [14]: where SSE is the sum of squares of errors, is the number of terms in the model, is a user-defined smoothing parameter, is the total number of descriptors contained in the model, and is the amount of data in the training set. SEE is defined as [15] The square of the correlation coefficient ( 2 ) defines the fraction of the total variation attributed to the model. The closer the value of 2 to 1.0, the better the model generated. 2 is expressed as Journal of Pathogens 3                where exp , pred , and training are the experimental activity, the predicted activity, and the mean experimental activity of the samples in the training set, respectively. Value of 2 varies directly with the increase in number of descriptors. Thus, 2 is not reliable to measure the goodness of fit of the model. Therefore, 2 is adjusted for the number of explanatory variables in the model. The adjusted 2 is defined as where is the number of independent variables in the model and is the number of descriptors. The strength of the QSAR equation to predict bioactivity of a compound was determined using the leave-one-out cross-validation method. The revised formula for crossvalidation regression coefficient where pred , exp , and training are the predicted, experimental, and mean values of experimental activity of the training set.

External Validation of Model
. External validation of model was assessed by 2 test value. The closer the value of 2 test to 1.0, the better the model generated.
where pred test and exp test are the predicted and experimental activity test set, while training represents mean values of experimental activity of the training set.

-Randomization
Test. -Randomization test is another useful external validation parameter to confirm that the built QSAR model is strong and is not inferred by chance. The -Randomization test was performed on the training set data [16]. For the built model to pass the -Randomization test, 2 should be more than 0.5.
where 2 is coefficient of determination for -Randomization, is coefficient of correlation for -Randomization, and is average " " of random models.

Evaluation of the Applicability Domain of the Model.
Assessment of the applicability domain of a QSAR model is an important approach to confirm that the model built is able to make good predictions within the chemical space for which it was developed [16]. The leverage approach was employed to describe the applicability domain of the QSAR model [17].
Leverage of a given chemical compound is defined as follows: where ℎ is the leverage of each compound, is the descriptor row-vector of the query compound , and is the × descriptor matrix of the training set compounds used to build the model. As a prediction tool, the warning leverage (ℎ * ) is the limit of normal values for outliers and is defined as where is the number of training compounds and is the number of descriptors in the model. The Williams plot, a plot of standardized residual versus leverage value, was employed to elucidate the relevance area of the model in terms of chemical space. Data is said to be an outlier if the standardized cross-validated residual produced by the model is greater than ±3.

Quality
Assurance of the Model. The fitting ability, stability, robustness, reliability, and predictive ability of the developed models were evaluated by internal and external validation parameters. The validation parameters were compared with the minimum recommended value for a generally acceptable QSAR model [17] shown in Table 2.
2.14. Docking Studies. Molecular docking study was carried out in order to elucidate which of the 1,2,4-Triazole derivatives has the best binding affinity against Mtb CYP121. The structure of Mtb CYP121 used in the study was obtained from protein data bank with PDB code 51BG. The prepared ligand and receptor were shown in Figure 1. The optimized structures of 1,2,4-Triazole derivatives initially saved as SDF files were converted to PDB files using Spartan 14 V 1.1.4. The prepared ligands were docked with prepared structures of Mtb CYP121 using AutoDock Vina incorporated in PyRx software. The docked results were compiled, visualized, and analyzed using Discovery Studio Visualizer.

Results and Discussion
QSAR was performed to investigate the structure-activity relationship of 50 compounds as potent antitubercular agents. The nature of model in a QSAR study is expressed by its fitting ability, stability, robustness, reliability, and forecast capacity. Experimental and predicted activities for 1,2,4-Triazole derivatives were presented in Table 3. The low residual value between experimental and predicted activity indicates that the model is of high predictability.
The genetic algorithm-multiple linear regression (GA-MLR) investigation led to the selection of six descriptors that were used to assemble a linear model for calculating predictive activity on Mycobacterium tuberculosis. Five QSAR models were built using Genetic Function Algorithm (GFA), but due to the statistical significance, model 1 was selected and reported as given below:  All the validation parameters for this mode were reported in Table 4 and were all in agreement with parameters presented in Table 2, which actually confirmed the robustness of the model.
The QSAR model generated in this research was compared with the models obtained in the literature [18,19]     Friedman LOF 0.40847300 2 -squared 0.92023900 3 Adjusted -squared 0.91017400 4 Cross-validated -squared ( 2 cv ) 0.89538600 5 Significant regression Yes 6 Significance-of-regression -value 58.41835200 7 Critical SOR -value (95%) 2.45854700 8 Replicate points 0 9 Computed experimental error 0.00000000 10 Lack-of-fit points 28 11 Min expt. error for nonsignificant LOF (95%) 0.24688800 From the above models, it could be seen that maxHCsats and 3D-radial distribution function (RDF) descriptors were also observed in the model generated in this research. This indicates that these descriptors have great influence on the activities of the inhibitory compounds against Mycobacterium tuberculosis. The validation parameters reported in this work and those reported in the literature were all in agreement with parameters presented in Table 2, which actually confirmed the robustness of the model. Descriptive statistics of the activity values of the training and test set data reported in Table 5 show that test set value range (8.2854 to 4.9074) was within the training set value range (8.0899 to 4.7441). Also, the mean and standard deviation of the test set activity value (6.4989 and 0.93) were approximately similar to those of the training set value (6.6222 and 0.96). This indicates that the test set is interpolative within the training. Therefore, Kennard and Stone's algorithm employed in this study was able to generate a test set that is a good reflection of the training set.
The name and symbol of the descriptors used in the QSAR optimization model were reported in Table 6. The presence of the three 2D and three 3D descriptors in the model suggests that these types of descriptors are able to characterize better anti-Mycobacterium tuberculosis activities of the compounds. Pearson's correlation matrix and statistics of the six descriptors employed in the QSAR model were reported in Table 7, which shows clearly that the correlation coefficients between each pair of descriptors are very low; thus, it can be inferred that there exists no significant intercorrelation among the descriptors used in building the model. The absolute -statistics value for each descriptor is greater than 2 at 95% significant level, which indicates that the selected descriptors were good. The estimated Variance Inflation Factor (VIF) values for all the descriptors were less than 4, which imply that the model generated was statistically significant and the descriptors were orthogonal. TDB9e 3D topological distance based autocorrelation-lag 9/weighted by Sanderson electronegativities 3D 5 RDF90i Radial distribution function-090/weighted by relative first ionization potential 3D 6 RDF110s Radial distribution function-110/weighted by relative I-state 3D  The mean effect (ME) values and standard regression coefficient ( ) reported in Table 8 provide important information on the effect of the molecular descriptors and the degree of contribution in the developed model. The signs and the magnitude of these descriptors combined with their mean effects indicate their individual strength and direction in influencing the activity of a compound. The null hypothesis says that there is no significant relationship between the descriptors and the activities of the inhibitor compounds. The values of the descriptors at 95% confidence limit shown in Table 8 are all less than 0.05. This implies that the alternative hypothesis is accepted. Hence, there is a relationship between the descriptors used in generating the model and the activities of the inhibitor compounds which take preference over the null hypothesis.
-Randomization parameter test was reported in Table 9. The low 2 and 2 values for several trials confirm that the developed QSAR model is robust, while the 2 value greater than 0.5 affirms that the created model is powerful and is not inferred by chance.

Interpretation of Selected Descriptors.
AATS7s is average Moreau-Broto Autocorrelation-lag 7/weighted by I-state autocorrelation descriptor. It is based on spatial dependent autocorrelation function that measures the strength of the relationship between observations (atomic or molecular properties) and space separating them (lag). This descriptor is obtained by taking the molecule atoms as the set of discrete points in space and an atomic property as the function evaluated at those points. When this descriptor is calculated on molecular graph, the lag coincides with the topological distance between any pair of the vertices. AATS7s is defined on the molecular graphs using atomic masses ( ), Sanderson electronegativity ( ), and inductive effect of pairs of atoms 7 bonds apart as the weighting scheme. These observations suggested that atomic masses and electronic distribution of the atoms that made up the molecule had significant effect on the antitubercular activity of the dataset. In addition, the signs of the regression coefficients for each descriptor indicated the direction of influence of the descriptors in the models such that positive regression coefficient associated with a descriptor will augment the activity profile of a compound, while the negative coefficient will diminish the activity of the compound. Electrotopological state atom-type descriptor nHBint3 represents count of E-state descriptors of strength for potential hydrogen bonds of path length 3. It is a spatial dependent 2D autocorrelation descriptor with the incorporation of Moran coefficient (index) in the measurement of the strength of the relationship between observations and space separating them. This Moran autocorrelation descriptor contained in the model reported in this study was defined on the molecular graphs using atomic masses ( ), Sanderson electronegativity ( ), and inductive effect of pairs of atom 3 bonds apart as the weighting scheme. These observations supported the claim that atomic masses and electronic distribution had significant effect on the antitubercular activities of the molecules The positive mean effect of this descriptor indicates that the inhibitory activity of 1,2,4-Triazole derivatives will increase with hydrogen bonds of path length 3. minHCsatu (minimum atom-type H E-state: H on C sp3 bonded to unsaturated C) is a 2D electrotopological state (E-state indices) atom-type descriptor. In general, Estate indices encode the intrinsic electronic state of each atom as perturbed by the electronic influences of all other atoms in the molecule within the context of the topological character of the molecule. maxHCsatu favors the addition of -CH3 to unsaturated C atom, for example, in benzene ring. Positive contribution of minHCsatu indicates that the inhibitory activity of 1,2,4-Triazole derivatives will increase with increase in the molecular descriptor.
TDB9e (3D topological distance based autocorrelationlag 9/weighted by Sanderson electronegativities) is positively correlated to the anticonvulsant activity, meaning that increase in its value augments the activity of the studied compounds. The descriptor measures the strength of the connection between atomic charges 9 bonds apart. The number of rings in the molecular system tends to increase the values of this descriptor as observed for molecules. This may be due to increase in the amount of -electrons in the molecular system, bringing about increase in the charge difference between atoms 9 bonds apart. The positive mean effect indicates a positive impact on the activity of the inhibitory compounds, which means that increasing the value of this descriptor produces higher activity of these compounds.
RDF90i and RDF110s are 3D radial distribution functions at 2.5 and 7.0 interatomic distance weighted by atomic masses. The radial distribution function is probability distribution to find an atom in a spherical volume of radius. RDF descriptors are independent of the size and rotation of the entire molecule. They describe the steric hindrance or the structure-activity properties of a molecule. The RDF descriptor provides valuable information about the bond distances, ring types, planar and nonplanar systems, and atom types. The presence of these descriptors in the model suggested the occurrence of a linear relationship between antitubercular activity and the 3D molecular distribution of atomic masses in the molecules calculated at radius of 2.0Å and 7.0Å from the geometrical centers of each molecule. RDF90i with positive mean effect (MF) indicates positive impact on the activity, while RDF110s with negative Predicted activity against experimental activity of training and test set was shown in Figures 2 and 3. The 2 value of 0.9202 for training set and 2 value of 0.8842 for test set recorded in this study were in agreement with GFA-derived 2 value reported in Table 2. This confirms the stability, reliability, and robustness of the model. Plot of standardized residual versus experimental activity shown in Figure 4 indicates a symmetric distribution or random scattering of data points above and below the standardized residual line equal to zero. Also, all the data points were within the boundary defined by standardized residual of ±2. Thus, it implies that there was no systemic error in model developed as the spread of residuals was pragmatic on both sides of zero [20].
The standardized residuals in the dataset were plotted against their leverages for every compound, leading to discovery of outliers and influential molecules in the models. The Williams plot of the standardized residuals versus the leverage value is shown in Figure 5. From our result, it is evident that no outlier is found, since all the compounds for both the training and test set were within the applicability domain of the square area except for three compounds that are structurally influential molecules (i.e., compounds 50, 39, and 36). These three compounds are said to be structurally influential molecules, since their leverage values are greater than the warning leverage (ℎ * = 0.60). Their high leverage values are responsible for swaying the performance of the model. This was attributed to strong differences in their chemical structures compared to other compounds in the dataset.

Molecular
Docking. Molecular docking study was carried out between the target (Mtb CYP121) and 1,2,4-Triazole derivatives. All the compounds were found to inhibit the receptor by occupying the active sites of the target protein (Mtb CYP121).
For target protein, binding affinity values for all the compounds range from −5.1 to −14.6 kcal/mol as reported in Table 10. However, four ligands (compounds 7, 8, 13, and 14) have higher binding score, which ranges from −10.0 to 14.6 kcal/mol, which were greater than their co-ligands.
These four ligands were visualized and analyzed in Discovery Studio Visualizer as shown in Figure 6. Binding     Journal of Pathogens while the molecular docking studies provide a valuable approach for structure base design. These two approaches will be of great help for pharmaceutical and medicinal chemists to design and synthesize new anti-Mycobacterium tuberculosis compounds.

Conflicts of Interest
The authors declare that they have no conflicts of interest.