The selection of the most relevant molecular descriptors to describe a target variable in the context of QSAR (Quantitative Structure-Activity Relationship) modelling is a challenging combinatorial optimization problem. In this paper, a novel software tool for addressing this task in the context of regression and classification modelling is presented. The methodology that implements the tool is organized into two phases. The first phase uses a multiobjective evolutionary technique to perform the selection of subsets of descriptors. The second phase performs an external validation of the chosen descriptors subsets in order to improve reliability. The tool functionalities have been illustrated through a case study for the estimation of the ready biodegradation property as an example of classification QSAR modelling. The results obtained show the usefulness and potential of this novel software tool that aims to reduce the time and costs of development in the drug discovery process.
Molecular Informatics is an emerging interdisciplinary that addresses mathematical and computational problems, related to molecule-based information encoding and processing, oriented to the discovery of new knowledge in several fields as pharmacology, material engineering, or environmental sciences [
During the last years, the sizes of chemical compound databases have expanded considerably. However, this abundance in the availability of data has not been able to avoid the growth of the failure rate in the preclinical phases and the “attrition rate”, that measure the proportion of candidate compounds to constitute new drugs that are discarded during the different phases of a drug design project [
QSAR studies require the codification of the chemical structure of compounds by a diversity of molecular descriptors [
Several machine learning approaches have been proposed for addressing the selection of molecular descriptors in an automatic [
In Soto et al
Later, a software tool, named as DELPHOS, was implemented based on this two-phase methodology [
In this paper, a novel software tool, called
QSAR models establish relationships between some structural characteristics of a chemical compound and a specific physicochemical or biological property of interest [
Representative scheme of descriptor selection process.
The methodology presented in [
DELPHOS two-phase feature selection methodology.
Linear regression is a mathematical method that models the relationship between an output variable (y), independent variables (xi), and a random error term (
Regression trees are decisions trees applied to regression problems. In this sense, each internal node of the tree represents a condition (for example, if the feature value exceeds or not a certain threshold) and each leaf denotes the function of regression to be used. The coefficients of this regression function will be the features that guided the path to that leaf. Further, provide a mechanism for pruning and thus keep the minimum height of the tree avoiding overfitting.
Neural Networks (multiperceptron) method classifies instances through backpropagation. This network can be monitored and modified during training time. The nodes in the network are all sigmoid (except for when the class is numeric in which case the output nodes become unthresholded linear units).
k-nearest neighbours method consists of assigning the instance to classify the majority label among the nearest k neighbours. The measure most commonly used to measure closeness is the Euclidean distance.
Random Forest generates a forest of random trees [
Random Committee builds an ensemble of randomized base classifiers. Each base classifier is built using a different random seed. The final predict value is a straight average of the predictions generated by the individual base classifiers.
In decision trees, the data is recursively divided into smaller sets with binary partitions. In each iteration of the method, different partitions are evaluated (evaluating the whole dataset) and the best one is chosen. The division of the data generates as output of the method a tree structure, where each node represents one of the input variables. Each leaf node in the tree represents a value of the destination variable. That is, the predicted value of the destination variable is obtained by the path traveled from the root to a leaf of the tree.
The dataset used for the classification case study was extracted from [
In order to evaluate the risk of a random correlation in a subset of selected molecular descriptors, an fs-randomization (feature selection randomization) technique was used. This method consists of randomly selecting a set of descriptors (with the same cardinality of the subset selected by a specific technique) from the original set of features. With these descriptors and the property original values, a new model is generated with the same experimental criteria that were used to obtain the final QSAR model. Finally, the percentage of correctly classified cases (%CC) and the Matthews Correlation Coefficient (MCC) are reported. This procedure is executed a considerable number of times in order to obtain a distribution of values with statistical significance.
A similar procedure was performed to evaluate the random correlation of the final QSAR model inferred from a set of descriptors using y-randomization [
In this section, details of the modifications made to the two-phase method developed by Soto et al. will be provided [
As mentioned above, MoDeSuS relies on the methodology presented in [
MoDeSuS two-phase feature selection methodology.
In the
MoDeSuS provides a graphical interface allowing the user to use the software without needing to know specific details of the code or of the different methods applied and a variety of features that will be explained below and can be summarized in Figure
MoDeSuS functionalities.
In this section, a case study in the context of classification problems to illustrate in detail the use of MoDeSuS in pharmacology will be explained. The property under study corresponds to the ready biodegradation of chemical compounds. When executing the tool several options will be available (Figure
MoDeSuS initial view.
When choosing the “First and Second Phase” option, a data loading window will be displayed (Figure
MoDeSuS data loading.
After data loading, another window will be displayed (Figure
MoDeSuS data loading verification.
When verifying that the data size is correct, the execution continues displaying the first phase parameters configuration window (Figure
MoDeSuS first phase.
In Figure
In the Wrapper Configuration section, it is possible to configure all the parameters that the wrapping method needs to perform the search. In this sense, the
In GA Settings section, it is possible to configure all the parameters associated with the evolutionary method. It is possible to determine the
MoDeSuS second phase.
In Figure
Once the execution of the second phase is finished, the results window will be displayed (Figure
MoDeSuS results view.
By pressing each button corresponding to the different statistical metrics, each of the graphs will be displayed. Figure
Performances of the three subsets with higher accuracy predictive obtained by using MoDeSuS. The percentage of cases correctly classified (%CC), the Average Receiver Operating Characteristic (ROC), the Matthews Correlation Coefficient (MCC), and the cardinality are reported.
Metrics | Subset_1C | Subset_2C | Subset_3C |
---|---|---|---|
%CC | 84 | 81 | 81 |
ROC | 0.89 | 0.88 | 0.87 |
MCC | 0.7 | 0.66 | 0.64 |
Cardinality | 15 | 15 | 15 |
MoDeSuS graphics: (a) Percentage of Cases Correctly Classified (%CC) and (b) Average Receiver Operating Characteristic (ROC).
With the three subsets reported in Table
Predictive accuracy of external validation process over subsets 1C, 2C, and 3C by using Weka. The percentage of cases correctly classifies (%CC), the Matthews Correlation Coefficient (MCC), precision (PR), and recall (RC) values is reported.
| | | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Run | %CC | MCC | PR | RC | %CC | MCC | PR | RC | %CC | MCC | PR | RC |
1 | 86.12 | 0.65 | 0.86 | 0.86 | 83.28 | 0.57 | 0.83 | 0.83 | 84.77 | 0.61 | 0.84 | 0.85 |
| ||||||||||||
2 | 86.86 | 0.67 | 0.87 | 0.87 | 83.43 | 0.57 | 0.83 | 0.83 | 85.37 | 0.63 | 0.85 | 0.85 |
| ||||||||||||
3 | 86.41 | 0.66 | 0.86 | 0.86 | 83.13 | 0.57 | 0.83 | 0.83 | 85.52 | 0.63 | 0.85 | 0.86 |
| ||||||||||||
4 | 86.86 | 0.67 | 0.86 | 0.87 | 82.53 | 0.55 | 0.82 | 0.83 | 84.77 | 0.61 | 0.84 | 0.85 |
| ||||||||||||
5 | 87.46 | 0.68 | 0.87 | 0.88 | 83.43 | 0.56 | 0.83 | 0.83 | 84.62 | 0.61 | 0.84 | 0.84 |
| ||||||||||||
6 | 87.31 | 0.68 | 0.87 | 0.87 | 84.47 | 0.6 | 0.84 | 0.85 | 85.22 | 0.63 | 0.85 | 0.85 |
| ||||||||||||
7 | 87.46 | 0.68 | 0.87 | 0.88 | 83.43 | 0.58 | 0.83 | 0.83 | 85.07 | 0.62 | 0.85 | 0.85 |
| ||||||||||||
8 | 85.97 | 0.65 | 0.86 | 0.86 | 82.98 | 0.57 | 0.83 | 0.83 | 85.67 | 0.64 | 0.85 | 0.86 |
| ||||||||||||
9 | 87.31 | 0.68 | 0.87 | 0.87 | 82.23 | 0.54 | 0.82 | 0.82 | 84.62 | 0.61 | 0.84 | 0.85 |
| ||||||||||||
10 | 87.61 | 0.69 | 0.87 | 0.88 | 82.23 | 0.54 | 0.82 | 0.82 | 85.07 | 0.62 | 0.85 | 0.85 |
| ||||||||||||
Avg. | | | | | 83.00 | 0.57 | 0.83 | 0.83 | 85.00 | 0.62 | 0.85 | 0.85 |
Based on the results shown in Table
In this section, two experiments will be presented in order to evaluate the risk of a random correlation in both the final descriptors subset chosen (Subset_1C) and in the final QSAR model inferred from these molecular descriptors. In this sense, the first aspect to evaluate is whether the Subset_1C selected by MoDeSuS has a significantly high predictive accuracy than other subsets of descriptors (of the same cardinality) randomly selected. Then, in a second instance, the final QSAR model is evaluated in order to ensure that it is not classifying compounds randomly.
In the first instance, a feature selection randomization (fs-randomization) was carried out in the following way: a thousand combinations of fifteen descriptors were randomly selected from the initial set of 1480 molecular descriptors. Then, for each random subset, a new QSAR model was learned under the same experimental conditions as the final QSAR model, finally reporting the %CC and MCC values. Table
Statistical results for
| | |||
Mode | Variance | Perc(99) | ||
| ||||
%CC | 79.85 | 8.81 | 86.71 | 87 |
MCC | 0.53 | 0.01 | 0.66 | 0.67 |
| ||||
| | |||
Mode | Variance | Perc(99) | ||
| ||||
%CC | 63.73 | 4.10 | 68.05 | 87 |
MCC | 0.03 | 0.001 | 0.09 | 0.67 |
As a next step, a y-randomization experiment has been executed. This technique is probably considered as the most powerful form of validation to evaluate the risk of chance correlation in QSAR models [
In this paper, a novel software tool for selection of molecular descriptors subsets in QSAR modelling is presented. This new feature selection tool, named MoDeSuS, was designed in order to address this task for regression and classification problems. The computational methodology behind MoDeSuS is organized as a two-phase procedure. The first one makes use of a multiobjective evolutionary technique that identifies promising subsets of molecular descriptors following a wrapper technique. The second phase complements the first one and it enables refining and improving the confidence in the chosen subsets of descriptors by using complex machine learning methods: Random Forest and Random Committee. Additionally, several visualization modes for the different metrics reported for classification and regression modelling are included in the software.
MoDeSuS facilities and functionalities had been illustrated by using the tool in a cases study that constitutes an example for classification QSAR modelling, where the estimated property corresponds to ready biodegradation of chemical compounds. Comparisons with the performance achieved by others QSAR studies had been discussed, showing the potentially and usefulness of this novel software. For that reason, we think that MoDeSuS can constitute a valuable tool for QSAR modelling practitioners, helping to reduce time and money costs in drug development projects.
As future work, we plan to extend our software tool for considering the applicability domain of the QSAR models, evolved from the different subsets of selected molecular descriptors recommended by MoDeSuS, as an additional performance metric. The applicability domain estimation is a key issue in QSAR modelling, because the generalizability of the models depends on it. This goal can be achieved by integrating, in the fitness function of the evolutionary algorithm, information about the applicability domain of the QSAR models generated by each subset of selected molecular descriptors explored during the first phase of MoDeSuS. In this way, the feature selection will not only produce accurate and interpretable QSAR models, but also ensure enhanced generalizability on new data, deriving in more reliable predictions.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This work is kindly supported by CONICET, Grant PIP 112-2012-0100471, and UNS, Grants PGI 24/N042 and PGI 24/ZM17.