Classification Prediction of Breast Cancer Based on Machine Learning

Breast cancer is the most common and deadly type of cancer in the world. Based on machine learning algorithms such as XGBoost, random forest, logistic regression, and K-nearest neighbor, this paper establishes different models to classify and predict breast cancer, so as to provide a reference for the early diagnosis of breast cancer. Recall indicates the probability of detecting malignant cancer cells in medical diagnosis, which is of great significance for the classification of breast cancer, so this article takes recall as the primary evaluation index and considers the precision, accuracy, and F1-score evaluation indicators to evaluate and compare the prediction effect of each model. In order to eliminate the influence of different dimensional concepts on the effect of the model, the data are standardized. In order to find the optimal subset and improve the accuracy of the model, 15 features were screened out as input to the model through the Pearson correlation test. The K-nearest neighbor model uses the cross-validation method to select the optimal k value by using recall as an evaluation index. For the problem of positive and negative sample imbalance, the hierarchical sampling method is used to extract the training set and test set proportionally according to different categories. The experimental results show that under different dataset division (8 : 2 and 7 : 3), the prediction effect of the same model will have different changes. Comparative analysis shows that the XGBoost model established in this paper (which divides the training set and test set by 8 : 2) has better effects, and its recall, precision, accuracy, and F1-score are 1.00, 0.960, 0.974, and 0.980, respectively.


Introduction
In the past ten years, the incidence of breast cancer in China has increased by 47%, and the incidence is increasing year by year, and the incidence of breast cancer is gradually younger [1]. Te pathogenesis of breast cancer is related to personal hormones, family history, marriage, and childbearing history [2]. Breast cancer is not easy to detect in the early stage, and has the characteristics of the early age of onset but late presentation [3,4]. At present, the main diagnosis of breast cancer is based on three methods: puncture cytology [5], ultrasound scan [6], and mammogram X-ray [7]. If a patient is caught early in breast cancer, the more likely it is to be cured and the better the prognosis. Terefore, regular examination and early diagnosis are very necessary for the prevention and timely detection of breast cancer.
In the medical feld, the establishment of models through machine learning methods can assist doctors to improve the detection rate of cancer, so as to achieve the purpose of early detection and early treatment. Machine learning methods have yielded good results in the diagnosis of cancer [8,9]. Wu et al. [10] observed the cell morphology under the microscope and found that there were obvious diferences between breast cancer cells and normal healthy cell parameters. Tis fnding provides a theoretical basis for many studies. While there are many machine learning methods currently applied to breast cancer cell classifcation, no single algorithm can be applied to all problems. Each type of machine learning algorithm has its own areas of expertise, so the choice of algorithm is diferent in diferent scenarios.
Shen et al. [11] used the XGBoost model to classify and predict breast cancer, and the accuracy reached 97.86%, and the recall reached 95.83%. Deng et al. [12] used the XGBoost algorithm to classify and predict breast cancer with an accuracy of 0.96 and a recall of 0.97. Monirujjaman Khan et al. [13] used multiple machine learning models to identify breast cancer, and random forest, decision tree, K-nearest neighbor, and logistic regression were the algorithms with higher F1-score, 96%, 95%, 90%, and 98%, respectively. Bhardwaj et al. [14] used multilayer perceptron (MLP), K-nearest neighbor (KNN), genetic algorithm (GP), and random forest (RF) to classify benign and malignant breast cancer cells, and the experimental results showed that the optimal classifer was RF with a classifcation accuracy of 96.24%. Dong and Ma [15] studied the possible markers of triple-negative breast cancer, and machine learning algorithms were used to predict whether people had triplenegative breast cancer. Te results show that the accuracy of the support vector machine (SVM) classifcation prediction model reaches 97.8%. In order to improve the accuracy of breast cancer identifcation methods and improve machine learning algorithms, Wang et al. [16] proposed a weighted AUC ensemble learning model based on SVM for breast cancer diagnosis, using C-SVM and V-SVM with 6 kernel functions to increase the diversity of the base model set and comparing diferent decision results with the Area Integration (WAUCE) model under the weighted receiving working characteristic curve. Te results show that on the small dataset, the proposed WAUCE structure reduces the variance of the diagnostic accuracy by up to 69.23% and improves the accuracy by 0.94%. Zheng et al. [17] tested the Wisconsin Breast Cancer (WDBC) dataset according to the K-means and support vector machine hybrid algorithm extracts tumor features and diagnoses breast cancer, and the results show that the hybrid algorithm improves the accuracy to 97.38%. Jia et al. [18] proposed a new population optimization algorithm, Whale Optimization Algorithm (WOA), which intelligently adjusts the parameters of the SVM model, and the experimental results show that the performance of the WOA-SVM model is signifcantly better than that of the traditional breast cancer recognition model, with an accuracy of 97.5%. In order to solve the problem of overftting of machine learning techniques in breast cancer classifcation, Singh et al. [19] proposed a functionally connected artifcial neural network (FLANN) and experimentally found that the model has high accuracy for early diagnosis of breast cancer. Mahesh et al. [20] propose a breast cancer prediction XGBoost ensemble technique based on known feature patterns, frst using synthetic minority oversampling technology (SMOTE) to deal with data imbalance and noise problems and then using naïve Bayes classifer, decision tree classifer, and random forest, respectively, combined with XGBoost and classifying the data. According to experimental analysis, XGBoost-Random Forest ensemble classifer has an accuracy rate of 98.20% in the early detection of breast cancer.
Based on XGBoost, random forest, logistic regression, K-nearest neighbor, and other machine learning methods, this paper establishes diferent models to classify and predict breast cancer, which provides a reference for early diagnosis of breast cancer. When most studies apply machine learning models to breast cancer cell diagnosis, they focus on using the precision, accuracy, and F1-score of the model as indicators to evaluate the quality of the model, while ignoring the medical diagnostic signifcance of the recall of the model, which indicates the proportion of malignant breast cancer cells that are predicted, and the higher the recall, the greater the probability of malignant cells being predicted in breast cancer cells. Terefore, this article takes recall as the primary index and considers precision, accuracy, and F1-score to evaluate the model used.
In the modeling process, data preprocessing is a very important part, and the efect of the predictive model is diferent depending on the processing method. In order to eliminate the infuence of diferent dimensional concepts on the efect of the model, the data is standardized. In order to fnd the optimal subset and improve the accuracy of the model, feature selection was made according to the Pearson correlation coefcient between the feature variable and the target variable. For the problem of positive and negative sample imbalance, the hierarchical sampling method is used to extract the training set and test set proportionally according to diferent categories. Considering that the prediction efect of machine learning models varies under diferent dataset divisions, this paper will use diferent dataset divisions (8 : 2 and 7 : 3) as two sets of experiments to observe the prediction efect of the model established in this paper.

Data Introduction.
Te data set used in this paper is the breast cancer data in the UCI data set, which was provided by the famous Dr. William from the Clinical Medicine Research Institute of the University of Wisconsin [21]. Features are computed from a digitized image of a fne needle aspirate (FNA) of a breast mass. Tey describe the characteristics of the cell nuclei present in the image. Te data set contained 569 experimental samples, including 357 benign samples and 212 malignant samples of breast cancer. For the cells extracted from each experimental object, the following ten features of its nucleus are mainly collected: radius (mean of the distance from center to points on the perimeter), perimeter, smoothness (local variation in radius lengths), area, compactness (perimeter * * 2/area-1.0), concavity (severity of concave portions of the contour), symmetry, texture (standard deviation of gray-scale values), concave points (number of concave portions of the contour), and fractal_dimension ("coastline approximation"-1). Te mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. Te classifcation label represents the type of breast cancer. Terefore, the sample data set contains a total of 30 features and one sample label feature (malignant and benign).

Data Standardization.
By observing the value range of each feature, it is found that the data values of diferent features difer greatly. In some models, diferent dimensions have a great infuence on the prediction efect. For example, the k-nearest neighbor algorithm based on distance division needs to keep the data dimension consistent, so the data need to be standardized before modeling. However, some models are less afected by dimensionality, such as the random forest algorithm. In order to make the experiment comparative, the data of diferent models are treated in the same way.
For problems with diferent sample data dimensions, the commonly used dimensionless processing methods include data standardization, and data standardization methods include Min-max standardization and Z-score standardization. Among them, when the data used have outliers outside the value range, or the maximum and minimum values of some indicators are unknown, the Z-score standardization can be used.
In this paper, according to the characteristics of the WDBC breast cancer dataset, the Z-score standardization was selected to process the data. Te data processed by the Zscore standardization [22] follows a standard normal distribution, that is, the mean is 0 and the variance is 1. Te formula for Z-score standardization is as follows: where mean is the mean of sample characteristic data and std is the standard deviation of sample characteristic data. Te data standardization results are shown in Table 1.

Feature Selection.
As an important part of the data preprocessing process, feature selection is to fnd the optimal subset, feature selection can reduce redundant and useless features to improve the accuracy of the model. Te feature selection method is generally divided into the overthinking method, the encapsulation method, and the embedding method. Te fltering method can be independent of the algorithm used later in the study and has high computational efciency and strong generalization ability [23], so the feature selection method in this paper uses the flter method, and the general method summary in the fltering method is shown in Table 2.
Te breast cancer data after 0 mean normalization meet the requirements of the Pearson correlation coefcient test in the fltering method; so in this paper, Pearson's correlation coefcient [24] is used to test the correlation between each feature and the target variable. Pearson's correlation coefcient formula is as follows: where ρ X 1 X 2 represents the correlation coefcient between two variables, Cov(X 1 , X 2 ) represents the covariance between two variables, EX 1 represents the expectation of variables, and DX 1 represents the variance of variables. According to the Pearson correlation coefcient, there are 15 features whose absolute value of the correlation coefcient with the target variable is greater than or equal to 0.5. Tese 15 feature variables are used for model construction, and the 15 feature and target variables are shown in Table 3.

Model Construction
In this paper, the categories of breast cancer are predicted based on XGBoost, random forest, logistic regression, and K-nearest Neighbor model, respectively. Malignant breast cancer is regarded as a positive sample, while benign breast cancer is regarded as a negative sample.
To solve the problem of sample imbalance, this paper uses a stratifed sampling method [25] to extract the training set and test set in proportion to all kinds of sample data. Stratifed sampling is also called type sampling. Te sample population is divided into subpopulations that are independent of each other. Random sampling was carried out in proportion in each subpopulation. Stratifed sampling draws a more representative sample and is more suitable for unbalanced samples.
Diferent data set partitioning may lead to diferent model efects. Terefore, this paper carries out two groups of experiments according to a diferent division of the sample data set. Te frst group divided the data set into a training set and test set in a ratio of 8 : 2, and the second group divided the data set into a training set and test set in a ratio of 7 : 3. Observe the model performance of the four algorithms under diferent data set partitioning.

Evaluation Indicators.
In this study, accuracy, precision, recall, and F1-score [26,27] were used to evaluate the prediction efect of the model. Considering the particularity of medical diagnosis, it is expected that all malignant breast cancer can be predicted. Terefore, recall is taken as an important evaluation index here. Te higher the recall is, the higher the proportion of malignant breast cancer that can be predicted. Te model classifcation results can generate a confusion matrix [28], as shown in Table 4.
Here, TP is a true positive, indicating the number of positive samples predicted as positive samples. TN is a true negative, indicating the number of negative samples predicted as negative samples. FP is a false positive, indicating the number of positive samples predicted from negative samples, which is called type 1 error. FN is a false negative, indicating the number of positive samples predicted as negative samples, which is called type 2 error.
Precision, abbreviated as P. For the predicted results, precision represents how many of the positive predicted samples are really positive samples, and the formula is Recall is also known as the true positive rate. For the original samples, the recall represents how many positive samples in the samples are predicted correctly, and the formula is Accuracy, referred to as A, refers to the proportion of all correctly predicted samples (including positive samples and negative samples) in the total sample. Te formula is Computational Intelligence and Neuroscience 3 Accuracy � F1-score is obtained by the weighted harmonic average of precision and recall due to the contradiction between the two evaluation indexes. F1-score is a comprehensive evaluation index of external methods, and a higher value indicates that the classifcation results are more efective. Te formula of index F1-score is

Prediction Model of Breast Cancer Based on XGBoost.
XGBoost, short for extreme gradient boosting, is a Boosting algorithm [29]. Both XGBoost and random forest are integration algorithms based on the decision tree. Diferent from the Bagging algorithm, Boosting algorithm builds weak learners one by one, accumulating multiple weak learners through continuous iteration [30]. Te objective function is where i represents the ith sample, m represents the sample size corresponding to the kth decision tree, and K represents the currently established weak learner. Te frst part of the objective function is the loss function, which measures the diference between the predicted value and the true value. Te second part of the function represents the complexity of the model. In order to optimize the tree after the t-th iteration, Taylor expansion is performed on the objective function. Ten, the objective function can be converted into where g i is the frst derivative of loss function l(y t i , Te addition of regular term can reduce the variance of the model and make the model obtained by the training set more simple, so as to prevent the occurrence of overftting.
Te XGBoost algorithm is used to train the model on the training data set. In the process of model training, the parameters are adjusted to obtain a better set of parameters, and fnally the optimal prediction model is obtained. Te model was used to predict breast cancer categories on the test set.
When the data set was divided into a training set and test set by 8 : 2, the accuracy, precision, recall, and F1-score of the XGBoost model were 0.974, 0.960, 1.00, and 0.980, respectively. When the XGBoost model was divided into a training set and test set by 7 : 3, the accuracy, precision, recall, and F1-score of the XGBoost model were 0.959, 0.946, 0.991, and 0.968, respectively. Te results show that the XGBoost model has better prediction performance when the data set is divided by 8 : 2. Te recall rate of 1 indicates that the XGBoost model correctly predicted all malignant breast cancers in the sample, which is very important for medical diagnosis.

Prediction Model of Breast Cancer Based on Random
Forest. Random forest is a supervised learning algorithm that integrates multiple trees through the Bagging idea [31][32][33]. Te bootstrap method is used to extract the training sample set from the original sample data, and the corresponding decision tree model is trained for each training set. Finally, all base classifers are voted on, and the one with the most votes is the fnal category.
When the data set was divided into a training set and test set by 8 : 2, the accuracy, precision, recall, and F1-score of the random forest model were 0.965, 0.947, 1.00, and 0.973, respectively. When the data set was divided into a training set and test set by 7 : 3, accuracy, precision, recall, and F1score were 0.953, 0.946, 0.981, and 0.963, respectively. Te results show that the random forest model has better prediction performance when the data set is divided by 8 : 2. Te recall rate of this model was also 1, indicating that the random forest model also correctly predicted all malignant breast cancer.

Prediction Model of Breast Cancer Based on Logistic
Regression. LR, Logistic Regression, is one of the most widely used methods in medical data analysis [34,35]. Logistic regression is a sigmoid function nested on the basis of a multiple linear regression model. Te basic form is where θ 0 , θ 1 . . . θ k is similar to the regression coefcient in multiple linear regression. When the data set was divided into a training set and test set by 8 : 2, accuracy, precision, recall, and F1-score of the logistic regression model were 0.947, 0.923, 1.00, and 0.960, respectively. When the data set was divided into a training set and test set by 7 : 3, accuracy, precision, recall, and F1score were 0.947, 0.922, 1.00, and 0.960, respectively. It can be seen from the results that the prediction efect of the logistic regression model is consistent under the two partitioning conditions. Te recall was also 1, indicating that all malignant breast cancer was correctly predicted by the logistic regression model.

Prediction Model of Breast Cancer Based on K-Nearest
Neighbor. Te K-nearest neighbor algorithm [36,37] projects samples into higher dimensional space according to variable values. Similar samples show spatial aggregation in higher dimensional space. Euclidean distance is commonly used to measure distances in k-nearest neighbors, and the calculation method is as follows: where x i , x j represents two diferent samples, x l i represents the value of sample i on attribute l, and x l j represents the value of sample j on attribute l.
Te three basic elements of the k-nearest neighbor algorithm are distance measurement, k-value selection, and classifcation decision rule.
For the problem of K value selection in the k-nearest neighbor algorithm, this paper uses the tenfold cross-validation method [38,39] and takes the recall rate as the model Computational Intelligence and Neuroscience evaluation index to select an appropriate k value. Let the value range of k be 1-40, and for each k, the cross-validation of tenfold is performed. Te k value with the maximum recall rate is the optimal k value. Te recall of diferent k values under the cross-validation of tenfold is shown in Figure 1.
It can be seen from Figure 1 that as k value increases, the recall decreases. When k � 3 and k � 5, the recall is the largest. Because the k value is set too small, it is easy to overft, so the k value is set as 5 here.
When the data set was divided into a training set and test set by 8 : 2, the accuracy, precision, recall, and F1-score of the k-nearest Neighbor model were 0.912, 0.888, 0.986, and 0.934, respectively. When the data set was divided into a training set and test set by 7 : 3, accuracy, precision, recall, and F1-score were 0.930, 0.906, 0.991, and 0.946, respectively. Te results show that the k-nearest neighbor model has better prediction performance when the data set is divided by 7 : 3.

Comparison and Analysis
In order to better understand the performance of the model established in this paper, this paper is based on the Python 3.9.7 development environment and uses the breast cancer data provided by Dr. William of the University of Wisconsin Clinical Medical Research Institute for experiments.
Te experimental environment is Windows 11 operating system, the processor is Intel(R) Core(TM) i5-1155G7@ 2.50 GHz 2.50 GHz, and the memory is 8.00 GB.
Te experimental parameters of each model are shown in Table 5, and the following three comparative analysis results will be carried out: (1) performance comparison of each model in this paper when the data set is divided into training set and test set in 8 : 2. (2) Performance comparison of each model in this paper when the data set is divided into a training set and test set in 7 : 3. (3) Comparison with some models in the literature [11][12][13][14].
(1) When the data set is divided into a training set and test set by 8 : 2, the performance of each model is shown in Table 6. As can be seen from Table 4, when dividing the training set and test set in a ratio of 8 : 2, the accuracy of the four machine learning methods is above 0.9, among which XGBoost and RF are above 0.95. Te prediction accuracy of XGBoost is 0.974, indicating a high prediction accuracy. For precision, the precision of the KNN model is not high, below 0.9, only 0.888. XGBoost has the highest precision, which is 0.960. As for the recall, the recall of XGBoost, random forest, and logistic regression algorithms are all 1. K-nearest neighbor algorithm has the lowest recall, but it is also above 0.95. For this study, recall rates represent the proportion of malignant breast cancer samples that were correctly diagnosed. In medicine, it is very important for a disease to be diagnosed. Te consequence of not being diagnosed is delayed treatment, which may result in patients missing the best time for treatment. Tis is much more serious than being diagnosed with a disease without having one. Terefore, the recall rate is a very important indicator in the feld of disease diagnosis. Here, the recall of XGBoost, random forest, and logistic regression algorithm are all 1, indicating that all malignant breast cancers in the samples have been diagnosed. For F1-score, it can be seen that the F1-score of XGBoost, random forest, and logistic regression are all above 0.95. Te F1-score of the XGBoost algorithm is the highest, reaching 0.980. Te K-nearest neighbor model has the lowest F1-score of 0.934. Taking the four indicators into consideration, it can be said that when the data set is divided into a training set and test set by 8 : 2, the overall model efect of the XGBoost algorithm is better than the other three models. XGBoost not only achieved a recall of 1 but also achieved a recall of 0.95 or more for the other three metrics.
(2) When the data set is divided into a training set and test set in 7 : 3, the performance of each model is shown in Table 7.
As can be seen from Table 8, the better performing models of the fve models are the Logistic regression model of the literature [13] and the XGBoost model established in this paper. Te recall and accuracy of the model in the literature [13] are 0.99 and 0.98, respectively, and the recall and accuracy of the model in this paper are 1.00 and 0.974, respectively, compared with the literature [13], the recall of the model is high, and the recall in medical diagnosis indicates the probability of detecting malignant cancer cells, which is of great signifcance for the classifcation of breast cancer cells, so the XGBoost model established in this paper has a better prediction efect and can be used as a medical tool to assist doctors to make treatment plans for breast cancer patients.

Conclusion
Tis paper mainly predicted the categories of breast cancer, from data preprocessing to feature selection, and then to the establishment of the model. Finally, the prediction results were compared and analyzed from many aspects. In this paper, recall is taken as an important index to predict malignant breast cancer samples as accurately as possible. Te original data set contained 30 features, and 15 features were selected as the input of the model through the Pearson correlation test. Before model construction, data were standardized to eliminate the impact of diferent dimensions on model efects. For the problem of unbalanced positive and negative samples, the stratifed sampling method is used to extract training sets and test sets proportionally according to diferent categories of data. When selecting the optimal k value in the k-nearest neighbor, the recall is used as the model evaluation index, so that the k value with the highest recall rate is the optimal value.
Te models are compared and analyzed from three aspects. Te results are shown as follows: (1) In the case of dividing the training set and the test set by 8 : 2, the recall of XGBoost, random forest, and logistic regression is 1, which can predict all malignant breast cancer, and the K-nearest neighbor recall is slightly lower than 0.986 compared with the other three models. For the prediction accuracy, precision, and F1-score of the model, the results of the XGBoost model are better than the results of random forest and logistic regression, which are 0.974, 0.96, and 0.98, respectively, so the XGboost model is selected as the fnal prediction model under the condition of 8 : 2 division of the training set and the test set. (2) In the case of dividing the training set and the test set at 7 : 3, the values of the four evaluation indicators for XGBoost and random forest decreased, while the values of the four evaluation indicators for the K-nearest neighbor model were improved, but for the recall, only the recall of the logistic regression model was 1, and the other models were above 0.98, so the model prediction efect of logistic regression was the best, and the prediction accuracy, precision, and F1-score of logistic regression were 0.947, 0.922, and 0.96, respectively. (3) It can be seen from experiments that under diferent divisions, the prediction efect of the model has diferent changes. Comparing the optimal models in the two sets of diferent experiments, it can be seen that the prediction accuracy, precision, and F1-score of the XGBoost model (which divides the training set and the test set by 8 : 2) are higher than that of the logistic regression model (which divides the training set and the test set by 7 : 3) when the recall is 1, so the XGBoost model (which divides the training set and the test set by 8 : 2) works best in the model established in this paper. In addition, compared with the models in the literature [11][12][13][14], the XGBoost model established in this paper has a better efect and can accurately identify malignant breast cancer cells. However, this research is limited to numerical datasets, and in the future, we will try to use deep learning algorithms to apply various feature extraction techniques to image data (such as X-ray images) to obtain better classifcation results.

Data Availability
Te data that support the fndings of this study are openly available in UCI Machine Learning Repository at https:// archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+ %28Diagnostic%29.

Conflicts of Interest
Te authors declare that they have no conficts of interest.