Joint Decision-Making Model Based on Consensus Modeling Technology for the Prediction of Drug-Induced Liver Injury

Drug-induced liver injury (DILI) is the major cause of clinical trial failure and postmarketing withdrawals of approved drugs. It is very expensive and time-consuming to evaluate hepatotoxicity using animal or cell-based experiments in the early stage of drug development. In this study, an in silico model based on the joint decision-making strategy was developed for DILI assessment using a relatively large dataset of 2608 compounds. Five consensus models were developed with PaDEL descriptors and PubChem, Substructure, Estate, and Klekota–Roth fingerprints, respectively. Submodels for each consensus model were obtained through joint optimization. *e parameters and features of each submodel were optimized jointly based on the hybrid quantum particle swarm optimization (HQPSO) algorithm. *e application domain (AD) based on the frequency-weighted and distance (FWD)based method and Tanimoto similarity index showed the wide AD of the qualified consensus models. A joint decision-making model was integrated by the qualified consensus models, and the overwhelming majority principle was used to improve the performance of consensus models. *e application scope narrowing caused by the overwhelming majority principle was successfully solved by joint decision-making.*e proposed model successfully predicted 99.2% of the compounds in the test set, with an accuracy of 80.0%, a sensitivity of 83.9, and a specificity of 73.3%. For an external validation set containing 390 compounds collected from DILIrank, 98.2% of the compounds were successfully predicted with an accuracy of 79.9%, a sensitivity of 97.1%, and a specificity of 66.0%. Furthermore, 25 privileged substructures responsible for DILI were identified from Substructure, PubChem, and Klekota–Roth fingerprints. *ese privileged substructures can be regarded as structural alerts in hepatotoxicity evaluation. Compared with the main published studies, our method exhibits certain advantage in data size, transparency, and standardization of the modeling process and accuracy and credibility of prediction results. It is a promising tool for virtual screening in the early stage of drug development.


Introduction
Bringing a new drug to market is a time-consuming and expensive process. Lots of candidate drugs fail to become drugs or withdraw from market mainly because of their safety and lack of efficacy [1]. Drug toxicity evaluation is an essential process of drug development as it is reportedly responsible for the attrition of approximately 30% of drug candidates [2]. Human adverse effects of drugs (AEDs) cost upward of $3.6 billion each year and constitute one of the top assessment of the drug-induced liver injury (DILI) potential of drug candidates is important and very useful for improving the efficiency of drug development. Commonly used techniques for DILI evaluation, such as animal studies, reactive metabolites, various cell cultures, human liver microsomes, and recombinant enzymes, are time-consuming and expensive [9][10][11][12]. Due to the lack of efficient laboratory tests for DILI assessment during drug design, the hepatotoxicity of some drugs is often observed when the population is exposed to drug postmarketing [13]. For some patients with chronic diseases, drug-induced hepatotoxicity or other toxicity (e.g., nephrotoxicity) is likely to be the ultimate killer, rather than the disease itself [14]. erefore, the development of safe and effective drugs can better benefit mankind. Developing fast and accurate experimental and computational approaches to assess the risk of DILI, to understand the biological pathways of DILI, and to search novel biomarkers for liver injury has aroused a great deal of interest worldwide [15,16]. In silico techniques, especially (quantitative) structure-activity relationship ((Q)SAR), can directly employ hepatotoxicity data to infer the toxicity of other candidate drugs and avoid the limitations of traditional animal or cell-based experiments [4,8]. Compared with experimental methods, (Q)SAR models are applicable to virtual molecules even before they are isolated or synthesized, so as to conduct virtual screening in the early stage of drug development [17]. (Q)SAR can also identify key fragments of chemical structures that have an impact on hepatotoxicity, enabling pharmaceutical chemists to select a suitable structure for drug molecules. With low attrition of time and money, and a fast readout, (Q)SAR models have increasingly become good tools for predicting hepatotoxicity in the early stage of drug development [18,19].
In the past decades, DILI data have been systematically recorded in numerous public databases (such as LiverTox, Hepatox, and Liver Toxicity Knowledge Base) and some published studies [20][21][22]. ese data have become valuable resources in the study of DILI, and many classification models have been developed with these datasets. As far as we know, the reported models for DILI predictions were quite different in terms of modeling technologies, data size, data sources, and features. In terms of modeling technologies, various algorithms, such as support vector machine (SVM) [5], random forest (RF) [8], Bayesian [23], deep learning methods [24,25], AdaBoost decision trees [26], k-nearest neighbor [27], and artificial neural networks (ANNs) [28][29][30], have been used to develop DILI prediction models. In terms of data size, the smallest dataset consists of only 74 molecules [29], and the largest dataset consists of 3712 molecules [31]. In terms of data source, hepatotoxicity data from clinical trials [10,11,31] and cell-based [24,32] and animal (such as rodents and nonrodents [15]) experiments have been used to develop DILI prediction models. ese data are mainly from a public hepatotoxicity database or published papers or mixed sources. In terms of the types and numbers of features, there were various descriptors or fingerprints (such as PubChem fingerprint (PubchemFP), SiRMS, MDs, ECFP6, Estate fingerprint (EstateFP), CDK fingerprint, CDK extended fingerprint, Klekota-Roth fingerprint (KRFP), MACCS keys, MOE, PaDEL, and Dragon) used in these models [17,21,[32][33][34][35][36][37]. e minimum number of features selected in these models is 12, and the maximum number of selected features exceeds 1000. Although most of DILI prediction models have relatively reasonable accuracy, they still have some shortcomings, mainly in the following aspects: (1) Some studies used less than 500 compounds to develop their DILI models [23,28,29,37]. Although these models have relatively high prediction accuracy, some of them may have little practical value due to their small datasets. It is difficult to collect sufficient and diverse molecular information for them to develop a classification model with a wide AD. (2) During the modeling process, not all models conform to the principles of the Organisation for Economic Co-operation and Development (OECD). Some models were not cross-verified [9,30,38], some models lacked an AD [31,39], and some models lacked mechanistic interpretation [9,40]. In addition, the modeling process of some models is not transparent enough [11,[17][18][19][20][21]30], which leads to poor repeatability and reproducibility. (3) Imbalanced chemical data are very common in DILI studies. It is difficult to achieve a well-balanced sensitivity and specificity of a machine learning model trained on imbalanced chemical data, but only Bajželj [30] used technical means to balance DILI negatives and DILI positives. (4) Optimization technology has been widely used in (Q)SAR research to improve the performance of (Q)SAR models [41]. But as far as we know, only Mulliner et al. [31] applied genetic algorithm (GA) optimization technology to optimize their DILI models. ese shortcomings limit the widespread applications of these models in drug discovery. Taking into consideration the above-mentioned issues, we attempt to develop a joint decision-making model based on a relatively large and chemically diverse DILI dataset to achieve better performance and credibility, as well as a wide AD. To achieve this goal, a total of 2608 compounds (1643 DILI positives/965 DILI negatives) were first manually collected. en, molecular features of all compounds were characterized by PaDEL descriptors, Substructure fingerprint (SubFP), PubchemFP, EstateFP, and KRFP, respectively. Five consensus models were constructed based on support vector machine (SVM) and hybrid quantum particle swarm optimization (HQPSO) algorithm [42] and consensus modeling technology. Four qualified consensus models were integrated into a joint decision-making model to improve the accuracy and credibility of prediction results. Furthermore, the contributions of PaDEL descriptors to DILI were analyzed, and some special substructures associated with DILI were identified as privileged substructures for structural alerts. We hope the model could be a useful tool for DILI assessment of drug candidates in the early stage of drug discovery, and the privileged substructures will be helpful for pharmaceutical chemists to design an appropriate structure for drug molecules.

Modeling Overview.
A workflow for the modeling process is shown in Figure 1. In the modeling process, principal component analysis (PCA) and self-organizing mapping (SOM) neural network were used to split the dataset into diverse training and test sets, and then the training set was divided into a fivefold cross-validation set for subsequent study. In feature selection of PaDEL descriptors, the mean decrease impurity (MDI) method [43] was used to evaluate the importance of descriptors. Each consensus model is composed of 20 submodels, each of which is constructed by SVM and joint optimization strategy. e HQPSO algorithm and fivefold cross-validation were employed to jointly optimize the parameters and features of submodels. e joint decision-making model was integrated with the qualified consensus models whose accuracy of fivefold cross-validation (ACC cv5 ) is great than 0.7. Finally, the test set and independent external validation set were used to validate the performance of the proposed joint decision-making model. Specific details of each modeling step are provided in subsequent sections.

Assessment of Model Performance.
A clear performance evaluation index is the premise of (Q)SAR research. In this study, several statistical parameters were used for assessment of the performance of constructed models, including sensitivity (SE), specificity (SP), prediction accuracy (ACC), Matthews correlation coefficient (MCC), and geometric mean (GMEN), which are defined, respectively, as follows: where TP, TN, FP, and FN are the numbers of true DILI positives, true DILI negatives, false DILI positives, and false DILI negatives, respectively, and SE and SP stand for the prediction accuracy of DILI positives and DILI negatives, respectively. e MCC value is generally regarded as a balanced measure, which can be used even if the classes have very different sizes. It returns a value between −1 and 1. A coefficient of 1 represents a perfect prediction, 0 indicates no better than random prediction, and -1 indicates total disagreement between prediction and observation [5]. e GMEN can indicate the balance between specificity and sensitivity of a machine learning model [44].
Additionally, the receiver-operating characteristic (ROC) curve was employed to assess the performance of constructed models. e ROC curve is plotted by varying the threshold values of prediction probabilities of each consensus model to show the separation ability of these binary classification models. Above the threshold value, the output of the model is predicted as positive; otherwise, it is negative. e curve is a graphical plot of sensitivity against the falsepositive rate (1 − specificity). e values of the area under the ROC curve (AUC) were also computed to quantitatively describe the ROC curve. e AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. e AUC value ranges from 0.5 to 1.0. AUC � 1.0 means the classifier is perfect, and AUC � 0.5 indicates the classifier is useless and random [39].

Data Collection.
It is well known that the more the chemically diverse compounds used for (Q)SAR modeling, the wider the AD of the obtained model. Merging datasets is an effective way to increase the amount of data and expand the application domain of DILI models [9]. In our study, a relatively large DILI dataset was extracted from five publications [15,24,27,34,39]. To improve the quality and reliability of data, we coped with them as follows: (1) Inorganic compounds, mixtures, and compounds without SMILES string were removed. (2) Compounds with no clear DILI effect were removed.
(3) All confused DILI effects were checked carefully. For the same compound listed in different publications, if the number of DILI-positive results and DILInegative results is the same, we delete the compound. If the number of DILI-positive results is greater than the number of DILI-negative results, we label it as DILI positive, and vice versa. (4) Duplicates in the dataset were removed.

Description and Descriptor
Pruning. In this study, to realize the joint decision-making strategy, and construct a robust and convincing (Q)SAR model, commonly used PaDEL descriptors and four types of fingerprints were calculated. ese fingerprints include PubchemFP (881 bits), SubFP (307 bits), KRFP (4860 bits), and EstateFP (79 bits). e descriptors and fingerprints were calculated through the online website (http://www.scbdd.com).
Firstly, PaDEL descriptors were calculated. To construct a stable DILI model, we only calculated 1D and 2D PaDEL descriptors to avoid uncertainty caused by molecular structure optimization when calculating 3D descriptors. After that, we examined each compound carefully. If a compound is with null descriptor and the variance of this descriptor in other compounds is greater than 0.3, we removed this compound to avoid deleting important descriptors in subsequent steps. To simplify the structure of constructed models and reduce redundancy, one pretreatment was performed to delete some Journal of Chemistry 3 uninformative or redundant descriptors before further selection: descriptors with constant or null values were excluded and if two descriptors were found to be correlated pairwise (greater than 0.85), one of them was deleted randomly [43]. After that, 2608 compounds (1643 DILI positives/965 DILI negatives) and 298 descriptors were reserved. e 2608 compounds used in this study are listed in the Supplementary Materials (DILI_DataSet.pdf ).

Data Splitting Based on PaDEL Descriptors.
e real predictive ability of a DILI model must be evaluated by the predictive accuracy of compounds not used in model development. is type of assessment requires the use of a test set. To obtain chemically diverse training and test sets, and keep the balance of DILI positives and DILI negatives in the training and test sets, PCA and SOM neural network were employed to split the modeling data. A workflow for data splitting is shown in Figure 2.
Firstly, the PCA method was used to deal with the input variables (characterized by 298 PaDEL descriptors) of the DILI-positive and DILI-negative datasets, respectively. en, two SOM neural networks were constructed to split the DILI-positive and DILI-negative compounds, respectively. For each SOM neural network, the first K principal components whose cumulative contribution rate is more than 90% (for the DILI-positive dataset, K � 46, and for the DILI-negative dataset, K � 41) were selected as the input of the SOM neural network.
After that, each SOM divided the DILI positives and DILI negatives into nine groups, respectively, so that each group of data has structural similarity. e clustering results of DILI positives and DILI negatives are shown in Figures 3(a) and 3(b), respectively. en, for each group of data, we randomly selected 80% of the data to constitute the training set and assigned the remaining 20% to the test set. Finally, we obtained a training set containing 2080 compounds (1311 DILI positives/769 DILI negatives) and a test set containing 528 compounds (332 DILI positives/196 DILI negatives). A detailed classification is listed in the Supplementary Materials (DILI_DataSet.pdf ).
In addition, to facilitate optimization and obtain a robust DILI model in subsequent studies, a fivefold cross-validation dataset was constructed using the DILI-positive and DILInegative training sets for each kind of descriptor and fingerprint. e flow of constructing the fivefold cross-validation dataset is shown in Figure 4.
Firstly, DILI-positive and DILI-negative sets were clustered to nine SOM nodes by the PCA-SOM method, respectively. And then the compounds in each node were randomly divided into five subsets of the same size and assigned to onefold to fivefold. In this way, each fold in the fivefold cross-validation set was as balanced as possible in terms of chemical diversity and the ratio of DILI positives and DILI negatives.
2.6. Model Building. As can be seen from Figure 1, the joint decision-making model consists of some qualified consensus models based on descriptors and fingerprints. Consensus modeling technology has been proved to be effective in improving the overall performance of the (Q)SAR model [38,45,46]. In our study, each consensus model is composed of 20 submodels constructed by SVM. SVM is an excellent kernel-based tool for classification or regression introduced by Vapnik et al [47]. e algorithm employs a kernel function (e.g., linear, polynomial, and radial) to maximize the decision boundary between classes and find an optimal hyperplane able to best discriminate the classes [48]. It is an algorithm suitable for processing a large number of features  and is naturally good at processing classification tasks. SVM has been used for hepatotoxicity predictions in some publications and has achieved better performance than most other algorithms (such as k-nearest neighbor (kNN), naive Bayes (NB), decision tree, random forest (RF), AdaBoost decision trees, and artificial neural network (ANN)) [5,21,31,33,45,49]. So, SVM was also employed to carry out our research in this study. When using descriptors to construct the (Q)SAR model, it is necessary to remove the useless descriptors and keep the important ones according to the importance of descriptors, so as to simplify the structure, avoid overfitting, improve the performance, and facilitate mechanistic interpretation [40,43]. Molecular descriptors quantitatively characterize the physical and chemical properties, topological structure, and electronic properties of compounds. While molecular fingerprints do not use quantitative description, only 1 and 0 indicate the presence and absence of specific molecular fragments or substructures, so it is difficult to evaluate the importance of these fragments or substructures and delete some unimportant bits according to their importance. erefore, the modeling approach using descriptors is different from that using molecular fingerprints, which will be described separately in subsequent sections.

Modeling with PaDEL Descriptors
(1) Preliminary Selection of Descriptors. To delete the unimportant descriptors, we implemented the importance evaluation of descriptors based on the mean decrease impurity (MDI) method. e method has been successfully applied to evaluate the importance of descriptors in our previous study [43]. e basic idea is to scramble the values of each descriptor in turn and observe its influence on the model's accuracy. Variables that have a significant impact on the accuracy of the model are also important. e detailed implementation process of MDI method can be found in the literature [43]. In our study, SVM was employed to implement the classification model in MDI. We only changed the evaluation index Ir (i) to adapt to our DILI classification model. e changed Ir (i) is defined as where ACC 0 is the accuracy of the established SVM model with original data and ACC 1 (i) is the accuracy of the model with the values of the ith descriptor being scrambled.
To avoid the influence of SVM parameters on the importance evaluation of the descriptors, we randomly selected 100 sets of different parameters for the SVM model (C∈ [10 6 ,10 8 ] and g ∈ [1, 100]) and repeated the above evaluation method 100 times. e mean value of the 100 evaluations was used to judge the importance of each descriptor. After evaluating the importance of descriptors, the score of each descriptor is graphically shown in Figure 5.
As can be seen from Figure 5, when the number of descriptors is greater than 150, its importance score is less than 0.01. In other words, the impact of descriptors behind 150 descriptors on the accuracy of the model is less than 1%. In order to simplify the structure of the model to be constructed, we selected 150 relatively important descriptors for further research.
(2) PaDEL Submodels Based on Joint Optimization Strategy. An optimized submodel with carefully selected descriptors is more conducive to improving the performance of the consensus model. MDI is essentially a single factor analysis method. e obtained 150 descriptors may still have multicollinearity.
e MDI method only removes most unnecessary descriptors but does not fully achieve the optimization of descriptors. Only the joint optimization of the descriptors and model parameters can guarantee that the most appropriate molecular descriptors are selected under the optimal model parameters [43]. In this paper, the HQPSO algorithm was employed to implement the joint optimization strategy. Herein, Gaussian radial basis function was selected as the kernel function of SVM to process the high-dimensional nonlinear modeling data, and the SVC_Ctype [50] SVM classification model was implemented. In this case, the penalty factor C and the width of radial basis function g are the parameters that affect the performance of SVM.
When the parameters and input variables (descriptors) of SVM models are jointly optimized, the value of C (C is a big positive integer) ranges from 10 6 to 10 8 , and the value of g ranges from 1 to 100. e 150 descriptors are coded into 15 integers between 0 and 1023 according to the coding method of our previous studies [43], and each integer represents 10 descriptors. A value of 0 (expressed in binary as "0000000000") means that no descriptor was selected, and 1023 (expressed in binary as "1111111111") means that the 10 descriptors were all selected [43]. erefore, the vector to be optimized has 17 dimensions. e first two dimensions represent the parameters C and g, and the last 15 dimensions represent the descriptors. e parameters C, g and descriptors were sought by the HQPSO algorithm based on fivefold cross-validation. e parameters of the HQPSO algorithm are set as follows: the population size is 30, the number of maximum iterations is 1000, and the internal parameters are λ � 1 and L � 10 (the values of λ and L are selected according to the study [42]). e fitness function of the HQPSO algorithm used for the optimization of SVM is defined as where GMEN 0 and GMEN cv5 are defined by equation (5)  training set and fivefold cross-validation dataset, respectively. GMEN 0 and GMEN cv5 will help to strike a balance between sensitivity and specificity of the SVM models trained on an imbalanced chemical dataset. At the same time, the fitness function can effectively avoid overfitting or underfitting, that is, the model has strong classification ability on the training set, but the fivefold cross-validation accuracy is poor, or the fivefold cross-validation accuracy is good, but the model has poor classification ability on the training set. After joint optimization, 20 optimized SVM-based classification models for DILI prediction were obtained with the training set. e optimized parameters of each SVM model are listed in Table 1, and the performance of each SVM model on the training set, fivefold cross-validation set, and test set is also listed in Table 1, respectively. Descriptors selected for model no. 1 are sorted in descending order according to the importance evaluated by the MDI method under the optimized model parameters and shown in Figure 6. e descriptors selected for other models are graphically shown in the Supplementary Materials (Figures S1-S19 in Supplementary.pdf ).
Using structurally diverse (Q)SAR models as submodels to construct a consensus model will help to achieve a better performance [51]. It is easy to obtain structurally diverse (Q) SAR models with different parameters and descriptors. From Table 1, Figure 6, and Figures S1-S19, we can see that these models use different combinations of parameters and descriptors, and they have relatively good predictive performance. From Table 1, we can see that the accuracy of each model is all greater than 69% in the fivefold cross-validation set, greater than 71% in the test set, and greater than 92% in the training set. ese models reached a relatively good balance between sensitivity and specificity in the training set, fivefold validation set, and test set, respectively. e relatively high accuracy of fivefold cross-validation indicates that these models are robust. In a word, the joint optimization strategy and data balancing method are helpful to obtain structurally diverse classification models with good stability and predictive performance.

Modeling with Fingerprints.
In this paper, fingerprints are also employed to construct consensus models to implement a joint decision-making strategy. Due to the particularity characterization method of molecular fingerprints, it is difficult to evaluate the importance of substructures in fingerprints. erefore, we cannot delete unimportant fingerprints according to the importance of each molecular fingerprint. We only delete the substructures that appeared less than 10 times or more than 2598 times in the modeling dataset. e joint optimization of the SubFP-, PubchemFP-, EstateFP-, and KRFP-based submodels is the same as that of the above models constructed by PaDEL descriptors.
In (Q)SAR modeling, the rule of thumb condition (that is, Topliss ratio [52]) requires that the chemical number over the number of selected variables should be at least 5 to avoid overfitting. KRFP has 4860 bits, and the number of substructures is obviously more than that of compounds. So, when the KRFP-based models are optimized, we limit the number of selected variables to no more than 400 to fulfill the Topliss ratio. 20 submodels for each kind of fingerprint were constructed, and the performance of optimized models is listed in the Supplementary Materials (Tables S1-S4 in Supplementary.pdf ).

Consensus Modeling.
e idea of consensus modeling is to integrate several weak learners into a strong learner, which can improve the stability and generalization performance of the (Q)SAR model to some extent [53]. In this study, five consensus models were integrated with submodels constructed by using descriptors and fingerprints, respectively. To fairly compare the performance of these consensus models, we set the threshold T to 50% according to the commonly used relative majority principle [33,38].
at is to say, for each consensus model, if the number of DILI-positive outputs in its submodels is greater than that of DILI-negative outputs, we define the consensus output as DILI positive or otherwise DILI negative. e comparison results are listed in Table 2. It can be seen from Table 1, Tables Tables S1-S4 in the Supplementary Materials, and Table 2 that the overall performance of these consensus models is all better than that of their submodels. erefore, the consensus modeling strategy is beneficial to improving the performance of classification models. From Table 2, we can also see that the performance of consensus models constructed by different features is different in ACC, SE, SP, and MCC. e ROC curves on the test set in Figure 7 also show the difference. e values of the area under the ROC curve (AUC) are also listed in Table 2.
In Table 2, consensus modes based on PaDEL, Pub-chemFP, KRFP, and SubFP passed the criteria, i.e., ACC cv5 > 0.7. erefore, the four models are used to integrate our joint decision-making model.    [54]. erefore, the ADs of these consensus models were analyzed before constructing the joint decision-making model.
(1) AD of PaDEL Consensus Model. In the PaDEL consensus model, the frequency of each descriptor used in the submodels is different, which indicates that different descriptors have different contributions to the output of the model. So, the frequency-weighted and distance (FWD)based method proposed in our previous study [43] was employed to define the AD of the consensus model. e essential difference between the FWD method and the traditional leverage method [55] is that it considers the importance of descriptors, and it can more reasonably define the AD of the consensus model. e detailed information of definition and implementation steps of the FWD method can be found in [43]. Figure 8 shows the AD of the PaDEL consensus model defined by the FWD method.
In Figure 8, black circles represent compounds in the training set and blue crosses represent compounds in the test set. e vertical red line represents a warning h * � 3.0776. e compounds outside the AD are plotted on the right side of the vertical red line, and the black and blue numbers are the serial number of the compounds in the training set and test set, respectively.
(2) AD of Fingerprint-Based Consensus Model. Fingerprints are different from descriptors. e FWD method and traditional leverage method are not suitable for fingerprints. In our study, the AD defined method based on the Tanimoto similarity index [39] is proposed. e definition and implementation steps of the method are as follows: Step 1: calculate the mean value and standard deviation of Tanimoto similarity index of the training set characterized by fingerprints and record them as T av and δ.
Step 2: for the ith (i � 1,2,3, . . ., N, where N is the number of compounds in the training set) compound in the training set, calculate its Tanimoto similarity index T tr (i,j) (i≠j) with all other compounds and record the mean value as T trav (i). If T trav (i) < T av − 3δ, consider the compound is outside the AD. Otherwise, it is inside the AD. For the ith (i � 1,2,3, . . ., N 1 , where N 1 is the number of compounds in the test set) compound in the test set, calculate its Tanimoto similarity index T te (i) with each compound in the training set and record the mean value as T teav (i). If T teav (i) < T av − 3δ, consider the compound is outside the AD. Otherwise, it is inside the AD. Figure 9 shows the AD of the PubchemFP consensus model.
In Figure 9, black circles represent compounds in the training set and blue crosses represent compounds in the test set. e vertical red line represents a warning h * � 0.91722 (h * � 1−(T av − 3δ)). e compounds outside the AD are plotted on the right side of the vertical red line, and the black and blue numbers are the serial number of the compounds in the training set and test set, respectively. e AD plot of the SubFP and KRFP consensus models is graphically shown in the Supplementary Materials ( Figures S20 and S21 in Supplementary.pdf ).
e AD coverage of the PaDEL consensus model on training and test sets is 0.9731 and 0.9716, and the AD coverage of the PubchemFP consensus model on training and test sets is 0.9817 and 0.9811, respectively. e AD coverage of the KRFP consensus model on training and test sets is 0.9817 and 0.9867, and the AD coverage of the SubFP consensus model on training and test sets is 0.9880 and 0.9867, respectively. at is to say, for each consensus model, most of the compounds are inside its AD. So, these consensus models have a wide AD.
As can be seen from Figure 10, the molecular structure of these compounds is relatively simple, and all contain halogen elements. ey have poor structural similarity to most compounds in the training set. is indicates that the above AD defined methods can detect structurally abnormal compounds from the dataset effectively, although different features and AD definition methods are used. So, the AD defined methods based on the FWD method and Tanimoto similarity index are suitable for the above models.

Joint Decision-Making Model Based on Overwhelming Majority Principle.
As is known to all, in both business decisions and political elections, the adoption of  Journal of Chemistry overwhelming majority principle is all conducive to improving the credibility of decision-making, although sometimes the decision-making results may not be obtained.
at is to say, a decision that cannot be passed by the majority is invalid. To improve the credibility and accuracy of the prediction results of these consensus models, we adopt the overwhelming majority principle instead of the commonly used relative majority principle (threshold T � 50%) to determine the output of consensus models. To solve the problem of invalid decisions, the output of the consensus model is defined as DILI positive, DILI negative, and DILI uncertain. We set the values of threshold T to 70%, 80%, and 90%, respectively, to reveal the influence of overwhelming majority principle on the performance of    consensus models. For a compound, if T or more of the predicted results are DILI positive or DILI negative, we regard the prediction results of the consensus model as DILI positive or DILI negative or otherwise DILI uncertain. e performance of consensus models on the test set that increases with T is listed in Table 3. It can be seen from Table 3 that, with the increase of T, the performance of each consensus model has obviously improved, but the coverage (ratio of compounds that can give definite prediction results) is significantly reduced. In other words, the performance of the model is improved, but the application scope is narrowed. In this paper, the application scope narrowing caused by the overwhelming majority principle will be solved by the joint decisionmaking strategy. First, the four qualified consensus models are sorted in descending order according to the values of ACC cv5 (T � 50%), and then they are concatenated to form a joint decision-making model. For a compound to be predicted, the working flow of the joint decision-making model is shown in Figure 11.
As shown in Figure 11, for a compound to be predicted, if the PubchemFP consensus model cannot give a clear prediction result, the compound will enter the PaDEL consensus model for prediction, and so on. e final output of the joint decision-making model is DILI positive, DILI negative, or DILI uncertain.

Diversity Analysis of Modeling Dataset.
In practical application, the high diversity of the modeling data is helpful to obtain a (Q)SAR model with a wide AD. To illustrate the diversity of our dataset, we applied the radar chart to explore the chemical space of the 2608 compounds. e PCA method was employed to transform the modeling data characterized by 150 selected PaDEL descriptors, and then the first 10 principal components were used to draw the radar map. e radar chart is shown in Figure 12. It can be seen from Figure 12 that the 2608 compounds in the modeling dataset covered a sufficiently large chemical space.
To further explore the chemical diversity of the dataset, the Tanimoto similarity index of the modeling dataset was also calculated using PubchemFP, SubFP, and KRFP, respectively. e heat map of the Tanimoto similarity index based on PubchemFP is shown in Figure 13. e average Tanimoto similarity index calculated from PubchemFP, SubFP, and KRFP is 0.335, 0.372, and 0.165, respectively. A lower Tanimoto similarity index means a more diverse dataset [56]. So, the chemical diversity of our modeling set is significant.
In this study, the chemical space was analyzed using PCA to check the quality of data splitting on training and test sets [57]. As shown in the PCA plot of the compounds based on 150 selected PaDEL descriptors (Figure 14), the compounds in the test set were basically distributed within the chemical space of the training set. erefore, the use of the chemically diverse modeling set as well as the balanced training set and the test set in this study is conducive to the construction of DILI models with a wide AD and the achievement of fair test performance.

Performance Analysis of Joint Decision-Making Model.
In order to reveal the ability of the joint decision-making model in solving the problem of application scope narrowing caused by the overwhelming majority principle, and to obtain the performance of the joint decision-making model changing with the threshold T, we set T as 70%, 80%, and 90%, respectively. e performance of the joint decision-making model on the test set is listed in Table 4. As can be seen from Table 4, when T is 70%, the clear prediction results can be obtained for 99.24% of the compounds in the test set, and the ACC of the predicted results is 80.0%.
From Tables 2, 3, and 4, we can see that ACC with T � 70% is obviously better than the consensus model with T � 50%.
e AD with T � 70% is obviously better than consensus models with T � 70% and even better than consensus models with T � 50%. erefore, the combination of overwhelming majority principle and joint decision-making strategy can effectively improve the performance of the consensus models. However, with the increase of T, the accuracy of the joint decision-making model does not improve significantly, but the coverage of the model is significantly reduced. erefore, we believe that T � 70% is appropriate in this study.

Application to External Validation Set.
To further test the performance of the joint decision-making model, an independent external validation set containing 390 compounds was collected from DILIrank (the detailed information of the 390 compounds is listed in the Supplementary Materials: External_validation_Set.pdf ).
e DILIrank dataset was developed by Chen et al. [58], and it includes 1036 marketed drugs approved by the US FDA before 2010. In the dataset, the DILI risk of the drugs was marked as v No-DILI concern, v Less-DILI concern, v Most-DILI concern, and v Ambiguous-DILI concern. First, we removed the compounds overlapping with our training set or classified as v Ambiguous-DILI concern.
e v Most-DILI concern and v Less-DILI concern compounds were marked as DILI positive, and v No-DILI concern compounds were marked as DILI negative. Next, we screened the remaining 390 compounds (172 DILI positives/218 DILI negatives) using the joint decisionmaking model. Finally, for the independent external dataset, the clear prediction results are obtained for 98.20% of the compounds.
e joint decision-making model achieves a prediction accuracy with ACC � 79.9%, SE � 97.1%, SP � 66.0%, and MCC � 64.7%. e performance on the external validation dataset indicates that the model we constructed has good performance and a wide AD.

Contributions of Selected PaDEL Descriptors to DILI.
In order to reveal the association between PaDEL descriptors and DILI, we calculated the mean value of each descriptor in Journal of Chemistry 11 DILI positives and DILI negatives, respectively, and defined the difference index as where DI(i) is the difference index of the ith descriptor and V PM (i), V NM (i), and V M (i) are the average value of the ith descriptor in DILI positives, DILI negatives, and the whole modeling dataset, respectively. e top 10 descriptors with a significant difference between DILI positives and DILI negatives are listed in Table 5.
It can be seen from Table 5 that compounds with higher values of ALogP, MATS2c, MATS7s, AATSC6v, and ATSC4i are more likely to be hepatotoxic positive. Compounds with lower values of MATS3c, ATSC1p, MATS2i, AATSC2v, and AATSC1p are more likely to be hepatotoxic positive. Among these descriptors, only MATS3c, ALogP, and ATSC1p had significant differences between hepatotoxic-positive and -negative compounds (their DI values are greater than 1). Among them, MATS3c is related to the charge state of compound molecules, and the difference between hepatotoxic-negative and -positive compounds is the most significant. ALogP represents the lipid water partition coefficient of the compound. e greater its value, the stronger the lipophilicity of the compound and the easier it is to present hepatotoxicity. ATSC1p is related to the polarizability of compound molecules. e stronger the polarizability, the greater the toxicity of molecules.
We plot the values of "MATS3c" and "ALogP" in DILI positives and DILI negatives in Figures 15(a) and 15(b), respectively.
Although for the two descriptors, the average value of DILI positives and DILI negatives is significantly different, there is still no clear boundary between them. is indicates that these descriptors have a weak differentiating effect on DILI potential and the DILI potential cannot be simply classified by individual or several simple descriptors. It is well known that DILI has complicated mechanisms, and     most mechanisms involve chemical reactions with proteins and other macromolecules in liver cells [44]. erefore, it is necessary to adopt the nonlinear modeling method to establish the DILI model. In order to further reveal the relationship between important descriptors and hepatotoxicity, we plot the first 50 important descriptors and their importance scores among the 150 descriptors used in the PaDEL models in Figure 16. From Figure 16, we made the following inferences: (1) "Mv" (mean atomic van der Waals volumes (scaled on a carbon atom)) is the most important descriptor that affects the performance of the model. is shows that van der Waals volumes will affect the hepatotoxicity of the compound. (2) "AMR" is the second important descriptor that affects the performance of the model. "AMR" is molar refractivity. AMR can be used as a measure of electron polarizability in molecules. Previous studies have shown that [15,17], for aquatic organisms, the stronger the polarizability, the greater the toxicity of the molecules. As an important descriptor, "MP" is also related to polarizability. (3) "MLogP" is a relatively important descriptor in the model, which predicts the oil-water partition coefficient of compounds. e greater the MLogP value, the stronger the fat solubility of the compound and the more likely it is to cause hepatotoxicity.   (4) "MATS2i" is also a relatively important descriptor. It is related to the first ionization potential. e first ionization potential is the energy required for a gaseous atom in the ground state to lose one electron in its outermost layer. e larger the initial ionization energy, the harder it is for an atom to lose an electron    Descriptor Score Figure 16: e first 50 important descriptors and their importance scores used in the PaDEL models. and the less likely the compound will both react and produce toxicity.

Structural Alert Identification.
To reveal the chemical structures related to DILI, several privileged substructures responsible for DILI positives were identified by frequency analysis [39] from PubchemFP, SubFP, EstateFP, and KRFP. e DILI-positive and DILI-negative frequencies of each substructure in the modeling dataset were calculated by where F DP (i) and F DN (i) are the frequency of the ith substructure in DILI positives and DILI negatives, respectively; N SP (i) and N SN (i) are the number of DILI positives and DILI negatives containing the ith substructure, respectively; N T is the number of compounds in the modeling dataset; and N P and N N are the number of DILI positives and DILI negatives in the modeling dataset, respectively. A substructure appearing in DILI positives far more frequently than DILI negatives could be considered a privileged substructure responsible for DILI positives. In this study, only the substructures present in 10 or more DILI positives and F DN /F DP <0.2 were analyzed. Finally, 25 substructures were identified as privileged substructures and are listed in Table 6. e detailed information of these privileged substructures is listed in the Supplementary Materials (Table S5 in Supplementary.pdf ). e chemical structures of privileged substructures or representative drugs containing privileged substructures listed in Table 6 are graphically shown in Figure 17. All the representative drugs containing privileged substructures listed in Table 6 have certain hepatotoxicity. Halothane (No. 1) is a general inhalation anesthetic, and it is forbidden for patients with liver dysfunction and biliary tract diseases. Nimodipine (No. 7) is a peripheral vasodilator and hypotensive. It can lead to hepatitis and jaundice in individual patients and elevated alkaline phosphatase and alanine aminotransferase. Trovafloxacin (No. 12), a fluoroquinolone antibiotic, has high anti-Gram-positive bacteria activity, especially against Streptococcus pneumoniae. It has significant hepatotoxicity, which can lead to elevated transaminase, hepatitis symptoms, severe liver damage, and even liver failure. Due to severe hepatotoxicity, it has been withdrawn from the US market. Fluvastatin (No. 14) is the first synthetic cholesterol-lowering drug. It can lead to continuous increase of alanine aminotransferase (ALT) or aspartate aminotransferase (AST). Minocycline (No. 16) is a kind of tetracycline antibiotic with broadspectrum antibacterial property. It can cause nausea, vomiting, jaundice, fatty liver, elevated serum aminotransferase, hematemesis, hematochezia, etc. In severe cases, it can cause coma and death. Pazufloxacin (No. 17) is a fluoroquinolone antibiotic. It can cause liver dysfunction and jaundice. Mithramycin (No. 24) is the only indication for testicular embryonal cell carcinoma, but it has considerable toxicity. e liver and kidney function damage is more prominent, and the liver and kidney function should be checked frequently in the treatment. Information about these drugs can be found in the FDA database (https://www. fda.gov/Drugs) or online drug database (https://www. druges.com).
Form Table 6 and Figure 17, we can see that fluorine-or bromine-containing groups and amine or nitro derivatives were identified as privileged substructures. Some fluorinecontaining groups and amine or nitro derivatives were also identified as privileged substructures in the literature [5,39]. Although different datasets and identification indexes were used to identify the privileged substructures in this study and the literature [5], 10 identical KRFP substructures (marked with * in Table 5) were simultaneously identified as privileged substructures.
is indicates that our privileged substructure identification method is credible. To a large extent, these 10 privileged substructures can be regarded as structural alerts responsible for the prediction of DILIpositive compounds in drug discovery, and other newly discovery privileged substructures listed in Table 6 should also be alert in DILI prediction.
In these privileged substructures, six substructures are fluorine-containing groups. e carbon-fluorine bond is commonly present in about one-fifth of pharmaceuticals, and it is metabolically stable in general. Fluorine acts as a bioisostere of the hydrogen atom in pharmaceuticals, and the lipophilicity of drugs would increase extremely with the introduction of fluorine atoms, thus enriching the intracellular concentration of hepatotoxic drugs [39]. ree substructures are bromine-containing groups. Bromine and fluorine are halogen functional groups with similar chemical properties. Specific fragment structures containing bromine atoms may also increase the risk of hepatotoxicity. 15 substructures are amine or nitro derivatives, and this class of chemicals could bind to proteins through covalent bonds with cysteine residues via Michael addition reactions and result in hepatotoxicity [59].

Comparison with Other Published Models.
To better illustrate the superiority of the joint decision-making model, we developed a detailed comparative analysis between the proposed model and the main published DILI models. Although many models have been developed for hepatotoxicity prediction, our interest is only focused on models constructed with a dataset larger than half of our dataset (N m > 1304) because models with small datasets make it difficult to collect sufficient and diverse molecular information to develop a superior model for practical application.
e results of this comparison can be found in Table 7.
As can be seen from Table 7, different models use different types and quantities of descriptors. However, most models use descriptors related to lipophilicity and molecular polarizability, which shows that there is a great relationship between them and hepatotoxicity. e main purpose of comparative analysis does not aim to rank the performance of these models but to give readers a general overview of the advantages of the joint decisionmaking model because all the previous models were constructed under different conditions (dataset, algorithms, features, and so on).
First, we compare and analyze whether these models conform to OECD principles. e widely accepted OECD principles for ensuring the effectiveness of (Q)SAR model state that a compliant (Q)SAR model must meet the following five conditions: (1) a defined endpoint, (2) an unambiguous algorithm, (3) a definite AD of applicability, (4) appropriate measures of goodness-of-fit, robustness, and predictability of the model, and (5) a reasonable mechanism explanation, if possible [60]. In Table 6, the study [40] and our model are consistent with the principles of OECD, but there is no mechanism explanation in [40]. Among these models, only our model gives specific model parameters, which makes the modeling process more transparent. ere is no cross-validation in [9,31,39]. Although cross-validation was carried out in [5] for determining the parameters of their model, the accuracy of cross-validation is not given clearly. So, the robustness of these models cannot be well explained. Studies [5,9,31,39] do not give clear ADs, which limits the practical application of the model. Second, we compare the modeling dataset of these models. Compared with the modeling dataset, our dataset contains 2608 compounds, which is more than that in other models in Table 6. Although 3712 compounds were used in [31], the largest model constructed in [31] only used 2171 compounds. We confirmed the diversity of our modeling dataset through radar chart and Tanimoto similarity index, and the AD coverage in test and external validation sets shows that the model constructed by a diverse dataset really obtained a wide AD. e modeling set was dealt with the PCA-SOM method to obtain diverse and balanced training and test sets. erefore, fairer test accuracy can be obtained in our model. For other models in Table 6, the study [39] carried out data diversity analysis, but it did not balance the training and test sets with any technical means. In [39], the specificity on the test set and external validation set is 34.38% and 37.93%, respectively. Such specificity makes it difficult for practical application, although the model has good sensitivity.
Finally, we compare the performance of these models. e study [5] and joint decision-making model proposed in this paper have good performance on the test set and external validation set, and the performance of other models is obviously worse than them. Although the study [40] has a good sensitivity on the external validation set (88.9%), the sensitivity comes from an external validation set of only 18 compounds, and the specificity is not given. Meanwhile, a serious imbalance between sensitivity (39.2 ± 2.6%) and specificity (87.1 ± 2.6%) of cross-validation indicates that the model developed in [40] is difficult to be applied in practice. Although the study [5] has better prediction ability on test and external validation sets than our model, the accuracy of cross-validation, AD coverage, and parameters of the model are not given clearly. Furthermore, the number of compounds in the test set and external validation set used in [5] is 413 and 151, respectively, significantly less than that in this study. In the (Q)SAR model, a larger validation set is easier to obtain a reliable evaluation result, so we believe that the two models cannot be compared fairly. Considering the predictive ability and AD coverage of the model, as well as the standardization of the modeling process, we believe that the joint decision-making model proposed in this paper is superior to other models in Table 6.
Overall, there is compelling evidence that we built a normative and robust DILI model with a wide AD and got a better performance than most of comparative studies. e modeling process is normative and transparent, which is in line with OECD principles, and the model can be replicated by other researchers. However, our study still has some limitations, which also exist in the most of previous studies. For example, like most studies, we only constructed the twoclass DILI model to distinguish whether the inquiry compound has hepatotoxicity or not, and the effects of dose and  metabolites on DILI were not considered in our study. e reason is that we lack adequate, reliable DILI data with rich information, such as dose, severity, and type of hepatotoxicity. If there are more detailed data sources available, all the limitations will be broken, and we will be able to construct a dose-dependent and metabolite-related multiclass DILI model that takes into account the severity and type of hepatotoxicity with the method proposed in this study.

Conclusions
In this study, we developed a new joint decision-making model for DILI prediction under the OECD principle. A relatively large DILI dataset containing 2608 compounds was collected and characterized by PaDEL, PubchemFP, SubFP, EstateFP, and KRFP for modeling. Consensus models for joint decision-making were constructed with structurally diverse submodels optimized by the joint optimization method. e overwhelming majority principle obviously improved the accuracy and credibility of the consensus model's outputs. e application scope narrowing caused by the overwhelming majority principle was successfully solved by the joint decision-making strategy. e joint decision-making model performed well on both 528 test compounds and 390 independent external validation compounds. Furthermore, the contributions of top 10 PaDEL descriptors with a significant difference between DILI-positive and DILI-negative compounds were analyzed. 25 privileged substructures responsible for DILI positives were identified from SubFP, PubchemFP, and KRFP. We hope these structural alerts can provide some useful warning information for pharmaceutical chemists. e comparison results show that the joint decision-making model has certain advantages in data size, standardization, and transparency of the modeling process, coverage of the AD, and performance on test and external verification sets. As far as we know, the overwhelming majority principle combined with the joint decision-making strategy for (Q)SAR modeling has not been used in (Q)SAR study. Under the OECD principle, this is the best (Q)SAR model for DILI prediction with such a large data size. It will be a promising tool for virtual screening in the early stage of drug discovery.

Data Availability
e data used to support the findings of this study are available from this article and supplementary information file.

Conflicts of Interest
e authors declare that there are no conflicts of interest. SVM: support vector machine; GA-SVM: genetic algorithm-support vector machine; RF: random forest; a detailed information of these methods can be found in [40]; NB: naive Bayes; kNN: k-nearest neighbor; CT: classification tree; N m is the number of compounds in the modeling dataset; N de is the number of descriptors/fingerprints used in (Q)SAR models; CV: cross-validation; CV5: fivefold cross-validation; EV: external validation; TV: validation on the test set; CSM: consensus model; JDM: joint decision-making model. 18 Journal of Chemistry