Prediction and Screening Model for Products Based on Fusion Regression and XGBoost Classification

Performance prediction based on candidates and screening based on predicted performance value are the core of product development. For example, the performance prediction and screening of equipment components and parts are an important guarantee for the reliability of equipment products. The prediction and screening of drug bioactivity value and performance are the keys to pharmaceutical product development. The main reasons for the failure of pharmaceutical discovery are the low bioactivity of the candidate compounds and the deficiencies in their efficacy and safety, which are related to the absorption, distribution, metabolism, excretion, and toxicity (ADMET) of the compounds. Therefore, it is very necessary to quickly and effectively perform systematic bioactivity value prediction and ADMET property evaluation for candidate compounds in the early stage of drug discovery. In this paper, a data-driven pharmaceutical products screening prediction model is proposed to screen drug candidates with higher bioactivity value and better ADMET properties. First, a quantitative prediction method for bioactivity value is proposed using the fusion regression of LGBM and neural network based on backpropagation (BP-NN). Then, the ADMET properties prediction method is proposed using XGBoost. According to the predicted bioactivity value and ADMET properties, the BVAP method is defined to screen the drug candidates. And the screening model is validated on the dataset of antagonized Erα active compounds, in which the mean square error (MSE) of fusion regression is 1.1496, the XGBoost prediction accuracy of ADMET properties are 94.0% for Caco-2, 95.7% for CYP3A4, 89.4% for HERG, 88.6% for hob, and 96.2% for Mn. Compared with the commonly used methods for ADMET properties such as SVM, RF, KNN, LDA, and NB, the XGBoost in this paper has the highest prediction accuracy and AUC value, which has better guiding significance and can help screen pharmaceutical product candidates with good bioactivity, pharmacokinetic properties, and safety.


Introduction
e inherent reliability of the equipment depends on the reliable design of the product. erefore, before the electronic components are installed on the whole machine or equipment, it is necessary to try to eliminate the problematic components as much as possible. erefore, it is necessary to screen the components based on the performance prediction value for improving the reliability of the equipment system. Similarly, drug screening is to inspect and test substances that may become drug products and predict their properties based on inspection and test values, to find drug values and clinical uses, and to provide data and data support for the research and development of new drugs. Drug discovery is a high-cost, high-risk process. According to Pharmaceutical Research and Manufacturers of America (PhRMA) statistics, it takes an average of 10 to 15 years and costs $2.6 billion for each drug to go from early discovery to Food and Drug Administration (FDA) approval. Despite this, PhRMA statistics find that the US biopharmaceutical industry's investment in new drug discovery is still gradually rising, from $15.2 billion in 1995 to about $90 billion in 2016 [1]. It can be seen that in the entire drug discovery process, the cost and time consumption of the clinical stage is huge, but the throughput is small [2]. In other words, the economic loss brought by stopping the development of the drug in the clinical stage is huge. erefore, how to improve the success rate of drug discovery and identifying drug candidates that may fail at an early stage is a problem that pharmaceutical companies have been trying to overcome [3][4][5].
e main reasons for the failure of contemporary drug research and development are the low bioactivity of the candidate compounds and the deficiencies in their efficacy and safety, which are related to the absorption, distribution, metabolism, excretion, and toxicity (ADMET) of the compounds [6,7]. At present, in vitro or in vivo experiments are mainly used to test these properties of compounds. However, due to species differences, these methods are costly, time-consuming, and often difficult to extrapolate from in vitro to in vivo or from animals to humans. On the other hand, the current performance optimization of ADMET properties mainly relies on expert experience, which partly comes from the knowledge of chemical biology and partly from the summary of previous experiments [8], but it is ultimately limited. With the production of more and more experimental data and the development of computer technology, we pay more and more attention to finding laws and building models to predict and optimize compounds in a data-driven way. e use of computational models is not only low cost, but to some extent, it may be more accurate than experiments and smarter than humans [9,10]. At the same time, related technologies such as machine learning are increasingly being used to predict the screening of compounds with specific pharmacodynamics and ADMET properties, which has promoted drug discovery and evaluation [11].
Establishing a compound activity prediction model is usually used to screen potential active compounds. e common method is to collect a series of compounds that act on the target and their bioactivity data for a target related to the disease and then use a series of molecular structure descriptors as independent variables to determine the dependent variable and bioactivity value, constructing the quantitative structure-activity relationship (QSAR) model [12], and then use the model to predict new compound molecules with better bioactivity or to guide the structure optimization of existing active compounds [13]. In the actual QSAR model, the value is experimentally measured and usually has a positive correlation with bioactivity, that is, the greater the value, the higher the bioactivity. In addition, for a compound to become a drug candidate, in addition to having good bioactivity, it also needs to have good pharmacokinetic properties and safety in the human body, collectively known as ADMET [14]. Among them, ADME mainly refers to the pharmacokinetic properties of the compound, which describes the law of the concentration of the compound in the organism over time, and T mainly refers to the toxic and side effects that the compound may produce in the human body. No matter how good a compound's bioactivity is, if its ADMET properties are poor, for example, it is difficult to be absorbed by the human body, or the metabolism rate in the body is too fast, or it has some toxicity; then it is still difficult to be a candidate drug, so ADMET properties also need to be predicted and optimized. Usually, to facilitate modeling and prediction, it is regarded as the binary classification method; 1 means good properties, while 0 means poor. However, the prediction accuracy is low, causing the screening for drug candidates still be inefficient and high-cost [15]. erefore, we propose the prediction and screening model for drug candidates based on fusion regression and extreme gradient boosting (XGBoost) and verify it on the data set that can antagonize estrogen receptors α (Erα) compounds. We use the fusion regression, which is the light gradient boosting machine (LGBM) for further feature extraction and BP-NN to predict bioactivity value and XGBoost binary classification to predict the ADMET properties. And the verification results show that the XGBoost works better than other methods. Based on the predicted bioactivity value and ADMET properties, we can use the bioactivity value and ADMET properties (BVAP) method to quantitatively screen drug candidates, which can improve the success rate of drug candidates screening and guide the drug screening process. e rest of this paper is organized as follows: Section 2 introduces the related work on the prediction of bioactivity value and ADMET properties. Section 3 introduces the prediction and screening model. Section 4 introduces the verification and experiment. Section 5 introduces the conclusion, some limitations, and the future expansion of the paper. Finally, some patents are declared, and relevant references are provided.

Related Work
Lei et al. [16] used six machine learning methods to establish the prediction model, including relevance vector machine (RVM), support vector machine (SVM), regularized random forest (RRF), extreme gradient boosting, naive Bayes (NB), and linear discriminant analysis (LDA). Erić et al. [17] explored artificial neural network (ANN) and SVM ensemble-based models, as well as knowledge-based approaches to descriptor selection. Jiang et al. [18] used seven machine learning methods including a deep learning method, two ensemble learning methods, and four classical machine learning methods to build classification models. Nayarisseri et al. [19] provided an overview based on some applications of machine learning based tools for drug identification, QSAR modeling, and ADMET analysis. Zhang et al. [20] summarized the history of machine learning and provided insight into recently developed deep learning approaches in rational drug discovery. Cheirdaris [21] provided an overview of the applications of artificial neural networks (ANNs). Yang et al. [22] developed PySmash to generate different types of representative substructures for safety evaluation. Hessler and Baringhaus et al. [23] put forward ANNs such as recurrent neural networks (RNNs) for drug discovery. Raju et al. [24] integrated in silico approaches to identify selective inhibitors. Lei et al. [25] developed a series of QSAR models for predicting urinary tract toxicity. Dobchev et al. [26] gave an overview of the strategies and current progress in using machine 2 Computational Intelligence and Neuroscience learning methods for drug design. Hsiao et al. [27] have applied machine learning methods for classification as well as regression analysis to a publicly available intravital data set to assess the intrinsic metabolic clearance in humans. ese results suggest the usefulness of machine learning techniques to derive robust and predictive models in the area of intravital ADMET modeling. eir suggestions provided ideas for our research. Kovalishyn and Poda [28] reported the batch pruning algorithm for variable selection. ey combined the ANN ensemble learning and self-organized map of Kohonen for clustering of descriptors. Gola et al. [29] considered advances in statistical modeling techniques for predictive ADMET models in drug discovery. Sun et al. [30] performed a QSAR and classification study based on a total of 134 base analogs related to their ED50 values. Li et al. [31] trained five machine learning classifiers, that is, K-nearest neighbor (KNN), SVM, random forest (RF), XGBoost, and DNN on each feature set of histone deacetylase 3 to facilitate prospective screening for inhibitors. Zhang et al. [32] used the genetic algorithm to select important molecular descriptors and used the NB for the in silico prediction model.
According to the current research of related work, datadriven methods such as machine learning are increasingly applied to predict bioactivity value and ADMET properties. It can be divided into two aspects.
On the one hand, when screening potential active compounds, the main methods are ANN and other basic machine learning algorithms, such as GA, MLP, and RFSA. In this paper, we first use LGBM to mine further features inside the data and use BP-NN, that is, ANN based on backpropagation, to predict bioactivity value more accurately.
On the other hand, when predicting the ADMET properties, the main methods are SVM, RF, KNN, NB, LDA, and their transformations. In this paper, we use XGBoost to predict ADMET properties. e advantages and disadvantages of the above methods are shown in Table 1.
However, few papers are researching the bioactivity value and the ADMET properties simultaneously and combining them into one model. e screening model we proposed could predict both of them to help screen pharmaceutical product candidates with good bioactivity, pharmacokinetic properties, and safety.

Model and Methods
e pharmaceutical products screening model is divided into four parts, and the flowchart of this model is shown in Figure 1.
It can be seen from Figure 1 that first, we do data preprocessing according to the preprocessing rules, and then it can be the input for parts 2 and 3. In part 2, we use the fusion regression to predict the bioactivity value. For part 2.1, we use LGBM for further feature extraction to get the molecular descriptors related to the bioactivity value. For part 2.2, we the molecular descriptors to predict the bioactivity value. In part 3, we use XGBoost to predict ADMET properties. In part 4, we propose a new BVAP method to screen drug candidates. e model can screen drug candidates with better bioactivity value and ADMET properties, thereby effectively serving the screening and preparation for drug candidates.

Data Preprocessing.
e molecular descriptor is a quantitative description symbol for drug molecules' structure and physical-chemical properties. Usually, molecular descriptors are robust, or there is a high linear correlation between molecular descriptors, so an appropriate subset of molecular descriptors should be extracted from them to make the model have better predictive ability. erefore, to remove low-information variables or redundant variables, the following steps are used: (1) If the relative variance of a molecular descriptor is less than σ, delete the molecular descriptor (2) If the correlation coefficient of a pair of molecular descriptors is greater than C, delete any one of the molecular descriptors Here, σ and C are constants that need to be defined. Generally, the larger the relative variance, the higher the information variables; the larger the correlation coefficient, the higher the redundancy between the data. erefore, we need to delete the molecular descriptor with low relative variance and high correlation coefficient.

Quantitative Prediction with Fusion Regression.
In this part, the fusion regression assembles the LGBM and BP-NN to make the quantitative prediction. e LGBM is used to extract the further features related to the bioactivity value and also reduce the dimension of the BP-NN input layer, thereby reducing the complexity of its training; the BP-NN is to use the extracted features to predict the bioactivity value.

LGBM for Feature Extraction.
LGBM is a gradient boosting framework based on the classification and regression tree. e negative gradient of the loss function is used as the approximate residual value of the current subtree to fit the new subtree. Its advantage is that while retaining large gradient samples, it randomly retains some small gradient samples and at the same time amplifies the information gain brought by small gradient samples.
In terms of feature extraction, LGBM optimizes the support for category features. It can directly input category features without additional expansion.
LGBM uses the basic idea of the gradient boosting decent tree to measure the importance of the feature by using the total number of times the feature is used to split in all decision trees [33]. en the features are sorted in descending order by importance, and the search is started from the complete set of sample features. According to the accuracy of the result, it is judged whether to remove the feature with the lowest importance and so on, to realize the feature selection. e flowchart of LGBM's feature extraction is shown in Figure 2.
Here, we choose LGBM to reduce the input dimension of BP-NN. It not only can reduce the time for BP-NN training    [16-19, 25, 29, 31] SVM avoids the complexity of high-dimensional space and directly uses the kernel function of this space SVM is difficult to implement for large training samples and determine the kernel function RF [16,19,[25][26][27]31] RF can handle very high-dimensional data without feature selection RF may overfit on some noisy classification or regression problems KNN [31] e training time complexity of KNN is lower than the support vector machine (SVM) e amount of calculation is large Compared with naive Bayes (NB), it has no assumptions about the data, has high accuracy, and is insensitive to outliers When the sample is unbalanced, the prediction accuracy of rare categories is low NB [16,32] NB performs well on small-scale data, and the algorithm is relatively simple e posterior probability is determined by the prior and the data, and then to determine the classification, so there is a certain error rate in the classification decision LDA [16,26] LDA works better when the sample classification information depends on the mean rather than the variance LDA is not suitable for dimensionality reduction of samples from non-Gaussian distributions and may overfit XGBoost [31] Regularization is added to the loss function to prevent overfitting e split gain of many leaf nodes at the same level is low, and it is unnecessary to perform further splits, which may bring unnecessary overhead Parallel computing makes the algorithm more efficient Memory optimization  but also can reserve the features most correlated with the bioactivity value.

BP-NN. (1) BP-NN.
First, sum the weights of the input data x 1 , x 2 , x 3 , · · · , x n and then substitute the result value of the feedforward network as the independent variable value of this layer into the activation function of this layer φ(v) � ReLU(v); the output value can be expressed as follows: en, BP-NN needs to continuously adjust the weight parameter w based on feedforward and backpropagation to complete the learning process until the output is consistent with the actual value of the training sample. e weight adjustment formula is where w k is the weight of multiple inputs after passing through the k-th loop, x ij is the j-th attribute's value of x i in the training set, and the parameter β is the learning efficiency. If the actual value obtained is the same as the judgment value, then we can continue to call the existing method to predict the weight; if the actual value obtained is different from the judgment value, it means that there is a problem, and then a method needs to be redesigned to calculate weight and modify parameters. e BP-NN method consists of some layers of the perceptron. e output of the perceptron is transformed by the ReLU function. e input dimension k of the first layer is determined by the number of molecular descriptors selected in Section 3.2.1. e input and output dimensions of the hidden layer are set to some value and the output dimension of the last layer is set to 1. e structure diagram of the BP-NN prediction method is shown in Figure 3.
(2) BP-NN Evaluation Index. e loss function of BP-NN uses L 1 loss, which refers to the average distance between the method predicted value y � (y 1 , y 2 , . . . , y n ) and the true value y � (y 1 , y 2 , . . . , y n ); it can be calculated as follows: We use the mean square error (MSE) to measure the BP-NN. e calculation of prediction accuracy also uses the L 1 norm, which MSE can be calculated as follows: where n is the number of the test data, y i is the predicted value, and y i is the true value.

Prediction of ADMET Properties
XGBoost. e extreme gradient boosting algorithm (XGBoost) is an integrated machine learning algorithm based on decision trees, using a gradient ascent framework, suitable for classification and regression problems, and used to solve supervised learning problems. Ensemble learning refers to the construction of multiple weak classifiers to predict the data set and then use a certain strategy to integrate the anticipated results of the multiple classifiers as the final prediction result [34]. It improves the traditional gradient boosting decision tree (GBDT) algorithm in terms of computing speed, generalization performance, and scalability.

Computational Intelligence and Neuroscience
Compared with gradient boosting, XGBoost introduces regularization in the loss function to establish the objective function: where L(θ) � y j , y j , As shown in (5), the objective function consists of two parts, L(θ) and Ω(θ), where θ represents various parameters learned from the given data. L(θ) is a differentiable convex loss function used to calculate the difference between the predicted result y j and the target result y j . Generally, there are two commonly used loss functions, namely the mean square loss function l(y j , y j ) � (y j , y j ) 2 and the logistic loss function l(y j , y j ) � y j ln(1 + e − y j ) + (1 − y j )ln ln(1 + e y j ).
is paper uses l(y j , y j ) � (y j , y j ) 2 as the loss function [35]. Ω(θ) is a regularization term, which is used to punish the complexity of the method (i.e., the regression tree) [35]. Among them, T represents the number of leaves of the tree; c represents the learning rate; and its value is between 0 and 1. λ is the regularization parameter; w is the leaf fraction; and w i is the score of the ith leaf. Compared with the traditional GBDT algorithm, XGBoost uses 1/2λ‖w‖ 2 , which can further avoid overfitting to strengthen the generalization ability of the method [36]. Given a data set with n samples and M characteristics, D � (x _ j , y j ) , where x _ j (j � 1, 2, . . . , n) represents a sample, y j is the corresponding label, and y is the output of the method y j is a set of K weak classifiers where f k represents the k-th weak classifier. In addition, considering that (5) uses a function as a parameter and cannot optimize the space by the traditional method in Euclid, XGBoost accumulates the regression tree and appends a new optimization object in each iteration [37]. erefore, at the t-th iteration, the objective function is defined as follows: In addition, XGBoost supports parallelization. It selects the best split point and performs parallel processing during enumeration, which greatly improves the efficiency of the algorithm and can be used in medicine prediction and screening.

XGBoost Evaluation Index.
is paper uses prediction accuracy and receiver operating characteristic (ROC) as evaluation indexes to evaluate XGBoost. ROC is present and expressed by the ROC curve.
For a binary classification problem, divide the instance into a positive class or a negative class; there will be four cases. If an instance is a positive class and is also predicted as a positive class, it is a true positive. If an instance is a negative class and is predicted as a positive class, it is called a false positive. Correspondingly, if the instance is a negative class and is predicted as a negative class, it is called a true negative, and if a positive class is predicted as a negative class, it is a false negative [38]. e contingency table of these four cases is shown in Table 2.
From the contingency table, the true positive rate (TPR) is introduced, and the calculation is Formula (9)  . (11) e horizontal axis of the ROC curve is FPR, and the vertical axis is TPR. According to the actual situation, the ROC curve allows intermediate states, and the test results can be divided into multiple ordered categories, and then statistical analysis can be performed. erefore, the ROC curve evaluation method is widely used in bioinformatics. We introduce area under roc curve (AUC) value to characterize the performance of the classifier. e AUC value is equal to the area enclosed by the ROC curve and the horizontal and vertical axis, usually between 0.5 and 1. e larger the AUC value, the better the comprehensive prediction performance.

BVAP-Pharmaceutical Product Candidates Screening.
Based on the predicted bioactivity value and ADMET properties in Sections 3.2 and 3.3, we define the BVAP method to evaluate drug candidates. It is a weight calculation method. Users or experimenters can adjust the weight of bioactivity value and ADMET properties according to actual discovering needs. e calculation of BVAP is BVAP � αBV + βAP, (12) where α is the weight of BV (i.e., bioactivity value) and β is the weight of AP (i.e., ADMET properties). As mentioned in Section 1, the ADMET properties include five kinds of properties, so we can detail the above calculation as follows: Computational Intelligence and Neuroscience where α is the weight of BV and β i , i � 1, 2, 3, 4, 5 is the weight of the five kinds of AP. We can adjust the weights α and β i to evaluate the drug candidates and then sort their BVAP values to get the best drug candidate.

Data Set Preprocessing.
e data set of compounds that antagonize the activity of Erα comes from question D of the 18th Chinese Graduate Mathematical Modeling Contest, which contains 729 molecular descriptors of 1,974 compounds, as well as the bioactivity value and ADMET properties of the compounds. e 1,974 compounds are the drug candidates that can antagonize breast cancer. e 729 molecular descriptors are a series of parameters used to describe the structural and property characteristics of Erα, including physicochemical properties (such as molecular weight, LogP, etc.), topological characteristics (such as the number of hydrogen bond donors, the number of hydrogen bond acceptors, etc.), and so on. e bioactivity value of Erα is usually expressed by pIC50, which is the experimental value. e larger the pIC50, the higher the bioactivity value. ADMET properties refer to good pharmacokinetic properties and safety in vivo. e related description of the data set is shown in Table 3.
In Table 3, there are four columns, and the first column is the compound molecular structure; the second column is the 729 molecular descriptors; and the third and the last columns are, respectively, the bioactivity value and ADMET properties. e molecular descriptors are equivalent to the features, and the bioactivity value and ADMET properties are the target value that needs to be predicted.
To facilitate modeling and prediction, this paper only considers the five ADMET properties of the compound in the data set, namely: (1) intestinal epithelial cell permeability (Caco-2), which can measure the ability of the compound to be absorbed by the human body; (2) cytochrome P450 enzyme (Cytochrome) P450, CYP3A4 subtype (CYP3A4), which is the main metabolic enzyme in the human body, which can measure the metabolic stability of compounds; (3) evaluation of compound cardiac safety (human ether-a-gogo-related gene, hERG), which can measure the cardiotoxicity of the compound; (4) human oral bioavailability (HOB), which can measure the proportion of the amount of medicine absorbed into the human blood circulation after entering the human body; and (5) micronucleus test (Micronucleus, MN), which is a method to detect whether the compound has genotoxicity [11,38]. For the five ADMET properties, Caco-2: "1" represents the compound has better small intestinal epithelial cell permeability, and "0" represents the compound has poor small intestinal epithelial cell permeability; CYP3A4: "1" represents that CYP3A4 can metabolize the compound, and "0" represents that the compound cannot be metabolized by CYP3A4; hERG: "1" represents that the compound has cardiotoxicity, and "0" represents that the compound does not have cardiotoxicity; HOB: "1" means that the oral bioavailability of the compound is good, and "0" means that the oral bioavailability of the compound is poor; and MN: "1" means that the compound has genotoxicity, and "0" means that the compound is not genotoxic.
We first do data exploration and find that none of the molecular descriptors in the data set have null values, so there is no need to consider missing values. en we preprocess 729 molecular descriptors. We set the variance σ � 0.05 and the correlation coefficient C � 0.95 through multiple experiments and validations. After data preprocessing (1) from Section 3.1, there are 361 remaining molecular descriptors. e correlation heat map is shown in Figure 4.
After data preprocessing (2) in Section 3.1, the number of molecular descriptors is reduced to 123.

Validation for Fusion Regression.
In this part, we first split the data: 70% of them are used for training, 10% for validation to get the proper parameters, and 20% for testing. Also, the evaluation index MSE is got on the test data.

4.2.1.
LGBM for Feature Extraction. After Section 4.1, to make the BP-NN easy to train and more accurate, further feature extraction and dimensionality reduction can be performed to screen out the features related to the bioactivity value.

BP-NN.
Based on the top 20 molecular descriptors in Section 4.2.1, we use the BP-NN to predict the bioactivity value.
By training and validation, we can decide and choose the best parameters for BP-NN. e BP-NN method consists of 1 input layer, 2 fully connected layers, and 1 output layer. e output of the perceptron is transformed by the ReLU function. e input dimension k of the first layer is set to 20 corresponding to the top 20 molecular descriptors. e input and output dimensions of the hidden layer are set to 300, and   Computational Intelligence and Neuroscience the output dimension of the last layer is set to 1. Among those hyperparameters, when the hyperparameter batch size is set to 2, 4, 8, and 16, the changes in the loss value are shown in Figure 6. It can be seen from Figure 6 that when batch size is set to 2, the method can converge to a lower loss after 30 rounds of training.
rough similar methods to filter other hyperparameters, a set of hyperparameters with relatively better final training effects is obtained, as shown in Table 4.
According to the obtained hyperparameters, the BP-NN is trained. en the fusion regression assembles the LGBM for feature extraction and BP-NN, obtaining the MSE on the test data.
e MSE is 1.1496, which can provide an effective method for the quantitative prediction of the bioactivity value.

Validation for XGBoost.
In this part, we first split the data: 70% of them are used for training, 10% for validation to get the proper parameters, and 20% for tests. Also, the prediction accuracy and the ROC curve are both got on the test data.

XGBoost.
is paper uses XGBoost to predict the ADMET properties based on the preprocessed data in Section 4.1. e prediction accuracy of ADMET properties is shown in Table 5.

4.4.
Discussion. Alternatively, this paper uses another five methods mentioned in related work, which are SVM, RF, LDA, K-nearest neighbor (KNN), and naive Bayes (NB), to predict ADMET properties. e prediction accuracy of ADMET properties and the accuracy comparison with XGBoost is shown in Table 6. e prediction accuracy of ADMET properties with the above five methods is compared with XGBoost used in this paper, as shown in Figure 7.
It can be seen from Figure 7 that the prediction accuracy of the XGBoost used in this paper for ADMET properties, which are Caco-2: 94.0%, CYP3A4: 95.7%, hERG: 89.4%, HOB: 88.6%, and MN: 96.2%, is higher compared with another five methods: SVM, RF, KNN, LDA, and NB. At the same time, we use the ROC curve to compare the comprehensive performance of the above six prediction methods. e comparison results are shown in Figure 8.
It can be seen from Figure 8, the AUC values of the XGBoost method used in this paper are: Caco-2: 0.933, CYP3A4: 0.954, hERG: 0.891, HOB: 0.839, and MN: 0.939, which are larger than the AUC values of another five methods. So the XGBoost has higher prediction accuracy and better performance.

Conclusions
Aiming at the problem of drug discovery failure in the early stage of drug development due to low bioactivity value and poor ADMET properties, this paper proposes a screening model for pharmaceutical products candidates with better bioactivity value and ADMET properties and validates the data set of compounds that antagonize the activity of Erα. Firstly, data preprocessing is made in the data set for initial feature extraction; then the fusion regression is used to predict the bioactivity value, including LGBM for further feature extraction and the BP-NN method for bioactivity value prediction, and the MSE of fusion regression is 1.1496. en the XGBoost is used to predict the ADMET properties, and the XGBoost prediction accuracy of ADMET properties are as follows: In summary, we make a difference in the prediction accuracy of ADMET properties compared with other methods, which is beneficial to improving the prediction and screening model for drug candidates. e prediction and screening model proposed in this paper has better comprehensive performance, which can also provide prediction services for the bioactivity value and ADMET properties.
While this paper also has some shortcomings: (1) the model proposed in this paper improves the problem of  inaccurate prediction of ADMET properties but only extracts important features, but some properties are not considered. e next step will continue to expand the model.
(2) e value or range of the molecular descriptor needs to be further verified after model expansion, and after that, a recommended reference value or range can be given. erefore, future work will be made to improve the above shortcomings.
Data Availability e dataset sources are from D Question of the 18th China Postgraduate Mathematical Contest in Modeling.

Disclosure
Jiaju Wu and Linggang Kong are the co-first authors.

Conflicts of Interest
e authors declare that there are no conflicts of interest.