Optimal Modeling of Anti-Breast Cancer Candidate Drugs Based on Graph Model Feature Selection

Breast cancer is one of the most widespread and fatal cancers in women. At present, anticancer drug-inhibiting estrogen receptor α subtype (ERα) can greatly improve the cure rate for breast cancer patients, so the research and development of this kind of drugs are very urgent. In this paper, the problem of how to screen excellent anticancer drugs is abstracted as an optimization problem. Firstly, the graph model is used to extract low-dimensional features with strong distinguishing and describing ability according to various attributes of candidate compounds, and then, kernel functions are used to map these features to high-dimensional space. Then, the quantitative analysis model of ERα biological activity and the classification model based on ADMET properties of the support vector machine are constructed. Finally, sequential least square programming (SLSQP) is utilized to solve the ERα biological activity model. The experimental results show that for anticancer data sets, compared with principal component analysis (PCA), the error rate of the graph model constructed in this paper is reduced by 6.4%, 15%, and 7.8% on mean absolute error (MAE), mean squared error (MSE), and root mean square error (RMSE), respectively. In terms of classification prediction, compared with principal component analysis (PCA), the recall and precision rates of this method are enhanced by 19.5% and 12.41%, respectively. Finally, the optimal biological activity value (IC50_nM) 34.6 and inhibitory biological activity value (pIC50) 7.46 were obtained.


Introduction
In countries all over the world, the proportion of cancer in many factors harmful to people's health is increasing year by year. Breast cancer has been the most common cancer, with a mortality rate of 11.5% to 28.4% [1]. At present, in the process of studying the treatment of breast cancer, some scholars have found that an estrogen receptor α subtype (estrogen receptor alpha, ERα) can be used as a key target for effective treatment of breast cancer, and compounds that can antagonize the activity of ERα can be used as candidate drugs for the treatment of breast cancer.
There are many kinds of anticancer compounds, and the extraction of low-dimensional features with strong description ability from various properties of the compounds can greatly improve the efficiency of screening anticancer drug candidates. Many scholars' abstract anticancer drug selection as feature extraction, such as Tassenberg et al. [2], use automatic feature extraction algorithm DenMap to detect single crystal dendrite core quickly, accurately, and repeatably and realize average automated feature extraction. Xue et al. [3] adopted an analytic hierarchy process entropy (HDE) feature extraction method based on the analytic hierarchy process to effectively diagnose the fault of rolling bearings, which eliminates the redundant information between features and retains the fault-related information. Yang et al. [4] used nonlinear simulation feature extraction to complete the tasks of speech detection and keyword location in inference sensor system with low power consumption. Xue et al. [5] proposed a feature extraction method based on asymmetric probability distribution function to reconstruct the distillation curve in industrial refining process, which is beneficial to the modeling and optimization of the oil refining process. Liu et al. [6] used feature extraction method to classify the biogenetic mechanism of circular RNA, which confirmed the view that multiple biogenetic mechanisms of different subsets of human CircRNA coexist. Zhu et al. [7] proposed a lightweight single image superresolution network with an expectation-maximization attention mechanism (EMASRN) to extract feature maps of different sizes. The experimental results demonstrate the superiority of EMASRN over state-of-the-art lightweight SISR methods in terms of both quantitative metrics and visual quality.
On the other hand, whether anticancer compounds can be selected as drug candidates, we also need to consider the following five characteristics: (1) intestinal epithelial cell permeability, (2) cytochrome P450 enzyme, (3) cardiac safety evaluation of compounds, (4) human oral bioavailability, and (5) micronucleus test. These five characteristics are often referred to as ADMET properties [8][9][10].
For the evaluation of the screening of anticancer compounds listed above, this problem is abstracted as the problem of classification and prediction of anticancer drugs based on ADMET properties. For example, Guo et al. [11] proposed a novel Relation Separation Network (RSNet) in this paper, aiming to boost few-shot learning by improving similar-class recognition performance. Compared to PT+MAP, RSNet improves the accuracy of classification on the CUB data set by approximately 5% and that of similar-class classification by more than 10%. Wang et al. [12] used a scalable window waveform sampling method (SWWS) based on the classification pattern to classify the workload requested by all users and then reasonably predict the usage of user cloud resources to minimize the cost of use. Wang et al. [13] combined missing value analysis with likelihood ratio test, introduced weighted decay random forest model, realized ICU readmission classification based on sparse data, and greatly reduced patients' expenses. Zhang et al. [14] adopted recursive partition classification method, established a classification prediction model of 58 between derivative inhibitors and du's amastigotes, and determined its molecular target and molecular mechanism. Steckenrider and Furukawa [15] adopted a highly random road crack perception network detection and classification method based on probability formula, which allows features to be extracted from crack images and retains the uncertainty in the detection. Chen et al. [16] adopted a classification diagnosis method combining FTIR near-infrared spectroscopy (NIRS) and support vector machine to differentiate malignant pleural effusion (MPE) from benign pleural effusion (BPE). Barth et al. [17] adopted the classification method of combining principal component analysis (PCA) with K-nearest neighbor (KNN) to address the problem of high correlation variables in wine classification. Lamge et al. [18] adopted a skin disease detection and classification method based on the combination of image processing technology and neural network to classify and evaluate patients' skin lesions images. Schultz et al. [19] used recurrent neural network and convolution neural network to classify airport performance on the basis of weather data. This method quantifies the correlation between airport performance decline and weather severity, and the prediction accuracy of aircraft take-off can reach more than 90%.
At present, the research and development of anticancer compounds in the medical field can be roughly classified into three steps. Firstly, the properties of the compounds were examined, and then, the activity model of the compound against cancer cells and the classification model of ADMET properties were constructed. Finally, the characteristic value of the antagonistic activity of the compound to cancer cells was obtained by solving the model. Therefore, in this paper, the research and development of antibreast cancer drugs are abstracted as an optimization problem, and an optimization method based on graph model feature extraction is constructed. In this paper, the kernel function is used to map the features to a higher dimensional space to construct a nonlinear quantitative prediction function of biological activity, and then, a classification prediction model of ADMET properties of anticancer drugs based on support vector machine (SVM) is constructed as a constraint. Sequential least square programming (SLSQP) is used to efficiently and quickly solve the distinguishing variable value of the optimal biological activity value, the optimal biological activity value (IC50_ nM), and the inhibitory biological activity value (pIC50). Specifically, we have studied the following four specific issues: (i) Question 1: For a wide variety of compounds, can the graph model method designed in this paper extract low-dimensional features with stronger description ability from many attributes than the principal cost analysis method?
(ii) Question 2: Can the quantitative prediction model constructed by regression method accurately predict the biological activity of ERα?
(iii) Question 3: Compared with principal component analysis, can support vector machine predict ADMET properties more accurately in a shorter time?
(iv) Question 4: Can the improved sequential least square programming (SLSQP) be solved faster and more accurately than the traditional intelligent optimization algorithm?
Through the research to solve the above problems, we can efficiently and intelligently predict whether the compound can become a candidate for breast cancer treatment and assist human doctors to accurately select effective antibreast cancer drugs for breast cancer patients. Effectively (1) A graph model method is proposed to extract lowdimensional features with strong descriptive ability and eliminate redundant features (2) Through the graph model-SVM classification prediction method, five classification models for the properties of ADMET are constructed to test whether the candidate drugs are suitable for patients (3) The bioactivity function of anticancer drug ERα was constructed, and the candidate anticancer drug ERα with optimal bioactivity was obtained by SLSQ algorithm as the best anticancer drug

Overview of Methods
The purpose of this paper is to select candidate drugs with superior efficacy from many anti-breast cancer compounds. First of all, the low-dimensional features with strong ability to describe effectively are selected from the many attributes of the compound. Then, the activity of the compound against cancer cells is measured to determine whether the compound can be used as a candidate drug. Below, we will describe the methodology framework and technical details in detail.

General Framework.
The overall framework of this approach is shown in Figure 1. The steps for screening anticancer drugs are as follows: (i) Step 1: use the graph model method to extract lowdimensional features with strong distinguishing ability from anti-breast cancer drug candidates (ii) Step 2: the kernel function is used to map the extracted features to high-dimensional space to construct a quantitative analysis model of ERα biological activity Step 3: at the same time, anti-breast cancer drugs also need to consider the properties of ADMET, using SVM to build a classification prediction model (iv) Step 4: taking the ERα biological activity function in step 2 as the objective function and the classification prediction model in step 3 as the constraint, the improved least square method-sequential least square programming (SLSQP) is used to solve the optimal ERα biological activity value

Graph Model-Minimum Spanning Tree (MST).
In view of the large number of attributes of anticancer drugs, we compare the attributes of compounds to nodes in graph theory and the correlation between attributes to the  Figure 1: the overall framework.
3 Computational and Mathematical Methods in Medicine distance between nodes, from which the adjacency matrix between attributes is established. Finally, all nodes can generate a minimum spanning tree to extract lowdimensional features with strong description ability. When constructing the adjacency matrix, we need to choose an appropriate threshold to construct the adjacency matrix, so the crux of the problem is how to select the applicable threshold.
First of all, N characteristic attributes of anticancer drugs are expressed as follows: where T O m,n is the data of the n(th) feature in the m(th) sample and N = 729, M is the number of samples. The correlation coefficient matrix is where T n 1 , T n 2 , are the two features ofn 1 , n 2 (th) and the similarity degree D between the features is defined as follows: Then, the Kruskal algorithm is used to generate the minimum spanning tree according to the distance matrix and then according to the correlation coefficient between nodes. HðH < NÞ important features (nodes) are selected. If these nodes are connected, the similarity between them is calculated, and the maximum distance value is selected as the threshold DNI min of the adjacency matrix.

Support Vector Machine (SVM)
. ADMET properties of anti-breast cancer compounds determine whether they can be used as candidate drugs, and the quality of the index can be regarded as a multiattribute dichotomous problem.
In order to meet the requirements of Question 3, this paper takes N compound attributes of anticancer drugs as independent variables and ADMET properties as dependent variables and then constructs five classification prediction models, which are compared with the principal component analysis.
The attribute data dimension of anticancer compounds is quite high, and there is a lot of redundancy. If the classification algorithm is directly used for classification, it is difficult to get satisfactory results in a short time. Therefore, this paper comprehensively applies SVM algorithm and graph model to classify and predict. The model framework is shown in Figure 2. The ideas are as follows: (i) For high-dimensional attribute data of compounds, the graph model is used to extract low-dimensional features with strong descriptive ability to reduce redundant information (ii) SVM is a good binary classifier, and satisfactory results can be obtained with fewer samples 2.4. Kernel Function and Sequential Least Square Programming (SLSQP). In this paper, the low-dimensional features of compounds with strong descriptive ability are screened out by the graph model, and these features are mapped to high-dimensional space by kernel function to further strengthen the ability of differentiation. Then, the best nonlinear ERα bioactivity model was fitted by the least square method. The relationship between features is nonlinear, and the least square method cannot effectively fit the nonlinear relationship, so we increase the least square algorithm. For solving nonlinear programming problems, it is of great importance whether the objective function and constraint conditions are continuous and smooth. If smooth, all decision variables are differentiable. The vector composed of partial derivatives of multivariate function can be used as the gradient direction indicating the fastest growth of empirical function. As the introduction of ADMET property increases the complexity of solving the problem, we add a Lagrange multiplier method (Lagrange multiplier) to the   Computational and Mathematical Methods in Medicine least square algorithm and transform the constrained optimization problem into an unconstrained problem by introducing additional variables. For this reason, we construct sequential least square programming (sequential least square programming optimization algorithm, SLSQP). SLSQP efficiently preserves the nonlinear relationship between features. When solving the parameters, this method can consider the constraints other than the objective function at the same time, which meets the need of considering the ADMET properties of anticancer compounds in Question 3. The basic description of the square programming problem is as follows:  nHCsatu  190  23  167  MDEN-11  142  33  109  nG12Ring  156  0  156  MLFER_BH  178  69  109  SHCsatu  167  16  151  maxdNH  140  31  109  nHBint3  166  29  137  mindNH  139  31  108  ETA_Shape_Y  70  207  137  minHdNH  138  31  107  maxHBd  38  167  129  maxaasC  73  179  106  SsNH2  158  29  129  nBondsD2  211  106  105  nsNH2 158   In equation (4), FðxÞ is the objective function. C j ðxÞ = 0 is the equality constraint. C j ðxÞ ≥ 0 is the inequality constraint. XL and XU are the lower and upper bounds of the variable x. The solution process of the algorithm is as follows: (i) Step 1: given the initial point x 0 and convergence accuracy ϵ, set the parameter k = 0 (ii) Step 2: FðxÞ is added to the Lagrangian operator at x 0 for Taylor expansion, and the current optimal solution s k is calculated (iii) Step 3: s k is taken as the search direction of the next iteration, and the next iteration point x k is obtained by one-dimensional search of FðxÞ according to constraints (iv) Step 4: if x k+1 satisfies the termination criterion of a given accuracy, x k+1 is taken as the optimal solution and Fðx k+1 Þ as the optimal cost of the objective function to terminate the calculation

Experimental Results and Analysis
In this section, we introduce the experimental environment, the source of the data set, and the specific experimental results. Depending on the four research questions designed, we have carried out comparative experiments and analysis. The program is clearly understood by Python3.6 programming, and the program runs on a microcomputer with CPU 2.40 GHz and 8 GB memory.
The data set used throughout this paper is the D problem data set provided by the 18th Huawei Cup Mathematical Modeling Competition. The data set contains a large number of 729 attribute data of anti-breast cancer compounds and the corresponding ADMET property data.  Table 1. If they are greater than the threshold DNI min , the two nodes are connected, and the value is 1 in the adjacency matrix. Otherwise, the value is 0. Finally, the minimum spanning tree of all nodes is obtained, and the degrees of all nodes are calculated and arranged in a descending order. The size of the value is invoked as the additional weight of the feature.
First, standardize the integrated data; then, use the graph model to solve the weight coefficients of 729 feature components; and select the first 15 feature components according to the weight, as shown in Table 2.
In order to make a quantitative comparison with PCA algorithm, MAE (mean absolute error), MSE (mean squared error), and RMSE (root mean square error) are selected to evaluate the effect of feature extraction. Among them, MAE is the average of the absolute error which can represent the actual situation of the predicted error. MSE is the expected value of the square of the difference between the estimated value and the true value of the parameter. RMSE is the arithmetic square root of MSE. They all can evaluate the change degree of the data, and the smaller their values are, the better the accuracy of the prediction model which provides a description of the experimental data. The calculation formula of each index is as follows:

Computational and Mathematical Methods in Medicine
The comparison of the three indicators of the algorithm and PCA for the data set of anti-breast cancer compounds is shown in Table 3.
It can be seen in Table 3 that the error index of the graph model is smaller and better than that of the principal component analysis. In MAE, the error rate of the graph model is 6.4% lower than that of PCA. In MSE, the error rate of the graph model is 15% lower than that of PCA. In RMSE, the error rate of the graph model is 7.8% lower than that of PCA. This shows the superiority of the graph model in extracting essential feature indexes and provides excellent characteristic variables for the construction of quantitative analysis model of biological activity of compounds against ERα.

Question 2: Quantitative Prediction of the Biological
Activity of Anticancer Substances against ERα. In this paper, the kernel function is used to fit the low-dimensional features extracted from problem 1 many times by highdimensional mapping to fit the nonlinear function. The fitting effect of the model is the best when the number of variables is 2 (for example, x 2 i ); that is, the new 135 features can be constructed from the kernel function through 15 feature variables and compared with Adaboost regression and Lasso regression. The fitting effect diagram and model evaluation comparison table are shown in Figure 3 and Table 4.
As can be seen from Figure 3, the nonlinear function fitted by high-dimensional mapping of features by using kernel functions has a good fitting effect on the test set data, and most of the predicted data sets are consistent with the test set data. The fitting degree score of the function is 0.6231.
As can be seen from Table 4, compared with Adaboost and Lasso, the sequential least square programming constructed in this paper reduces the error rate by 22.4% and 32.2% on MAE, 23.8% and 44.3% in MSE, and 12.7% and 25.4% in RMSE and increases the fitting degree by 19.4% and 48.0%, respectively. This shows that the nonlinear inhibition ERα bioactivity optimization model constructed by sequential least square programming has good fitting effect and a small error.
Therefore, this paper constructs the following objective function: where x i is the characteristic variable, k i is the regression coefficient, and b is the intercept of the function. The polynomial coefficients c k m ðm = 1, 2,⋯,135Þ (partial display) are shown in Table 5.
As can be seen from Table 5, the regression coefficients of these 135 features can be divided into three types: greater than 0, less than 0, and equal to 0. From the mathematical point of view, we can see that there is an inflection point in the model; that is, there is a local optimal solution. Among them, the characteristic of regression coefficient greater than 0 was positively correlated with the inhibition of cancer cell activity, the characteristic of regression coeffi-cient less than 0 was negatively correlated with the inhibition of cancer cell activity, and the characteristic of regression coefficient equal to 0 had no effect on the inhibition of cancer cell activity.

Question 3: Classification Prediction Results of ADMET Properties of Anticancer Substances Based on Support
Vector Machine (SVM). Considering the high-dimensional attributes of anticancer compounds, firstly, PCA and graph model are used to extract features with strong distinguishing ability, and then, SVM algorithm is used to classify the features, and then, the optimal classification prediction model MST-SVM of compound ADMET is constructed and compared with PCA-SVM model. The operation flow chart of Question 3 is shown in Figure 4.   Among them, blue dots and orange dots in all the classification effect maps represent 0 and 1, respectively, that is, the ADMET properties of the compound. From the classification effective images of the above five property classification models, we can see that the classification effect is obvious, and the positive and negative samples can be well distinguished. This shows that the classification prediction model of ADMET properties of MST-SVM compounds has a good classification effect. In order to make a more accurate quantitative analysis, we further introduce four indicators, namely, accuracy, accuracy, recall, and F1-score, to evaluate the classification effect, as shown in Table 6.
As can show in Table 6, the classification prediction models based on the properties of Caco-2, CYP3A4 and MN are 0.8580, 0.9379 and 1.0000 in recall, respectively. This shows that there are small false counterexamples in the classification model, and the model can achieve good results in predicting correct counterexamples. The classification prediction models based on CYP3A4 and hERG properties are 0.8947 and 0.8643 in precision, respectively. This shows that there are few correct examples in the prediction of the classification model, and the model can achieve good results in predicting the correct instances. At the same time, in accuracy, all the five classification models have higher accuracy scores. This shows that the graph model-support vector machine classification prediction model set out in the present paper can accurately judge whether the candidate drugs conform to the ADMET properties.On the basis of the above, the classification results of the ADMET properties of compounds by this method and PCA-SVM are compared as showed in Table 7: It can be seen from Table 7 that the score of the classification prediction model of ADMET properties of PCA-SVM compounds is better than that of the MST-SVM classification prediction model in accuracy and F1-score. However,   41% in recall and precision, indicating that its classification effect is stable. Therefore, based on the above analysis, this paper chooses the MST-SVM classification prediction model which is more stable.

Question 4: Sequential Least Square Programming (SLSQP) Is Used to Solve the Quantitative Prediction
Model of the Bioactivity of Anticancer Substances against ERα. In this paper, the compound is required to optimize the inhibition of ERα biological activity (pIC50 value) under the premise of satisfying ADMET properties (at least three properties), so that the pIC50 value is the best (the higher the better), and the corresponding characteristic variables are obtained. In this paper, the equation (6) is taken as the objective function, and 15 important characteristic variables are numerically constrained.    (6) and (7), the optimal pIC50 value is obtained. Then, the ADMET property of the compound is tested; that is to say, the SVM classification prediction model is used to make a classification prediction according to the 15 variables x i to be obtained in this problem. If the variables can satisfy more than 3 ADMET properties, then the values of the variables x i and pIC50 are directly output as the final optimization scheme, and if they are not satisfied, the new variable _ x i is put into the constraint to judge.
The mathematical expression of ADMET property constraints is as follows: Finally, we obtained the optimal inhibitory activity value of ERα and the corresponding characteristic variable x i value. The results are shown in Table 8.
Depending on the above table, under the constraint of the ADMET property of the compound, the values of 15 characteristic variables, the optimal bioactivity value (IC50_nM), and the inhibitory bioactivity value (pIC50) were obtained. Among them, the distinguishing variable of positive value was positively correlated with the biological activity of ERα. The characteristic variable of negative value was negatively correlated with the biological activity of ERα, and the characteristic variable of 0 value had no effect on the biological activity of ERα. Finally, the functional relationship between inhibitory activity value and biological activity value is presented in the following equation: In equation (9), F 1 is the value of biological activity and F 2 is the value of inhibitory activity.
It is found that 15 characteristic variable inputs satisfy more than 3 ADMET properties when they are put into the constraint. This demonstrates that the low-dimensional features screened by the graph model not only have strong ability to describe and distinguish but also perform better through ADMET properties. It can be seen that the quantitative analysis model of ERα biological activity and the classification model based on ADMET properties of the support vector machine can quickly and accurately screen effective compounds from anti-breast cancer drug candidates. The running time of the experimental program is 0.1369 s, and the number of iterations is 22. From the analysis of time complexity and iterative process, sequential least square programming (SLSQP) algorithm constructed in this paper is

Conclusions
In view of the increasing number of breast cancer patients, various kinds of anti-breast cancer candidate drugs, and great pressure on doctors to use anti-breast cancer drugs, this paper is aimed at the problem of screening anti-breast cancer candidate drugs that propose an optimal modeling method of anti-breast cancer candidate drugs based on graph model feature extraction. Compared with the traditional feature extraction methods (such as principal component analysis and random forest), the graph model feature extraction method proposed in this paper addresses the problem of large error and low accuracy of the existing methods in the evaluation index. At the same time, the classification prediction model constructed in this paper is utilized to effectively detect whether the drug will have adverse reactions to the human body when screening candidate drugs. Therefore, through the method of this paper, we can efficiently and intelligently predict whether the compound can become a candidate drug for the treatment of breast cancer and assist human doctors to accurately select effective anti-breast cancer drugs for breast cancer patients, which is of great significance to improve the cure rate of breast cancer.

Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.