Medical and Health Data Classification Method Based on Machine Learning

The information defined in medical health data is researched based on machine learning-related algorithms. Also, this paper used random forest and other related algorithms to perform health data training and fitting. Research shows that the algorithm proposed in the paper can improve the progress of health data classification. The algorithm can provide technical support for the improvement of medical data classification.


Introduction
With the development of China's medical industry, the medical market has become more and more complicated. Establishing a sound medical credit system is one of the important means to regulate the medical market. e lack of standard measures for participants in China's medical market has led to frequent breaches of trust, such as registration breaches [1].
is seriously wastes limited medical resources. is article studies the dishonesty behaviour in medical treatment. e purpose of the research is to increase the medical industry's management of market participants and improve the market access threshold and management level. In recent years, the development of computer technology has made great progress in data mining and machine learning technology. Supporting massive amounts of data and using machine learning algorithms to utilize the data effectively can enhance the value of data. When the scenario of the algorithm is in the medical field, the relevant historical behaviour data of medical market participants can be used to predict whether there is a risk of dishonesty. is assists medical market managers in making decisions.
is paper studies the decision tree and random forest algorithm. Since the random forest algorithm is based on the ensemble learning idea, it effectively avoids noise in the training data set, so there will be no overfitting phenomenon. e simulation results also show that this method performs better than logistic regression and K-nearest algorithm in identifying dishonest behaviours [2].
is has important reference significance for the establishment of the medical credit system.

Decision Tree Algorithm.
In recent years, decision trees have been one of the most widely used algorithms in machine learning. Compared with the neural network algorithm decision tree, it has the characteristics of flexibility and strong interpretability. Flexibility embodied in the decision tree can prune the tree structure according to the wishes of the algorithm designer and improve the algorithm's performance. Interpretability is embodied in the decisionmaking standard with extremely high confidence when each root node of the decision tree makes a decision. e decision tree is composed of three parts: directed edges of internal nodes and leaf nodes. e basic process is shown in Figure 1.
It can be seen from Figure 1 that a classification tree can divide a data set into different classifications C i using different feature dimensions A. When the classification tree is classified, different classification tree algorithms are based on different node classification standards [3]. Figure 1 is based on information gain. e method of defining information gain is as follows.
First, define information entropy. e definition of information entropy of data set D is as follows: Different features have conditional information entropy Info A(D) for D: (2) e information gain of the feature at this time is Academically, different types of decision trees use different node decision criteria. is article also uses the Gini coefficient as the decision criterion [4]. e decision tree at this time is called CART. For multiclassification problems, when there are K different categories, mark p k as the probability that the current sample is category k. At this time, the Gini coefficient can be defined according to the probability distribution: When there is a sample set D, the Gini coefficient of the sample can be written as (5) e sample set D is divided into different subsets D 1 and D 2 according to feature A: e Gini coefficient divided according to feature A is

Integrated Learning and Random Forest.
In actual engineering applications, the ability of a single decision tree is limited. e noise and outliers in the training data will cause overfitting of the decision tree, which will seriously affect the accuracy of the decision tree to classify unknown data. erefore, pruning operations are required after the decision tree is generated. To avoid overfitting, random forest (RF) can also be used [5]. e establishment of the random forest depends on the guidance of integrated learning thought (bagging). Bagging is characterized by random sampling of samples and classifiers. It includes two steps: selecting samples for model training and classifying based on classifiers. e process is shown in Figure 2.
It can be seen from Figure 2 that the difficulty of ensemble learning lies in the random sampling of samples and the design of combining strategies between different learners. is article uses booststrapping to extract training samples T during random sampling. e extraction is divided into k rounds. After k rounds of extraction, k training  sets can be obtained [6]. e probability of not being selected in the original data is After randomly generating the training set, the decision tree is generated. K training sets can train k decision trees. During the training process, the decision tree does not need to be pruned. After the training is completed, k weak classifiers can be obtained. en, assign the same weight to different classifiers to get a strong classifier [7]. To measure the classification performance of the random forest, we need to define the interval function of the random forest: For the classifier e interval function of the random forest algorithm can be denoted as var(mr), and its upper bound can be given by the following formula: s represents the classification strength of a single decision tree, and P represents the correlation between decision trees. e above formula shows that the interval function of the random forest has an upper limit. is upper limit can be lowered when the strength of a single classification tree is increased, and the correlation between each tree is reduced.

Data Input.
is article investigates the three parties of patients, hospitals, and medical companies. en, from the patients' perspective, random forest algorithms identify dishonest behaviours to build a healthy medical platform. First of all, this article obtained data sets related to citizen credit from public data sets [8]. In the random forest training, because the privacy of medical patients is involved, the residents' credit data can only be obtained from the foreign platform Lending Club. en, add relevant medical record information to each data to ensure that the data are suitable for the application scenarios required in this article. After the data set is collected, the data are preprocessed according to the data preprocessing process shown in Figure 3 [9].
When dealing with outliers, mainly eliminate data items that are seriously inconsistent with logic. e method used in the article is Turkey's algorithm. is method can define the data of 1.5 times the 4th quartile range as outliers according to the distribution characteristics of the data and eliminate them. e data normalization process uses the following formula: After the data are normalized, the value of the data itself will not cause an offset to the evaluation result, and the evaluation result only depends on the influence of the data attribute. e final step of preprocessing is to perform correlation analysis on the data. We eliminate the more relevant attributes in the data to reduce the dimensionality of the input data under the premise of ensuring the classification accuracy, thereby improving the efficiency of model training and classification.

Algorithm Training and Testing.
In the random forest, it is necessary to reasonably set the relevant parameters of the random forest algorithm according to the feature vector dimension and the data dimension.
is paper analyzes the error of random forest under different parameters [10]. e error analysis results are shown in Figures 4 and 5. Figures 4 and 5, respectively, show the impact of different ntry and mtry on the model's accuracy during training. For random forests, the value of ntryshould be large enough to ensure that the model can converge during the training process. Figure 4 shows the model error under different ntry when the default mtry is mtry � sqrt(M). It can be seen that when ntry reaches 1000, the model error drops to a stable level. Figure 5 shows the effect of different mtry on model accuracy when ntry � 1000. It can be seen that when mtry < 6, the model error increases as mtry increases. When mtry > 6, the model error increases as mtry increases. erefore, the optimal value of mtry is 6. In addition to mtry and ntry, the important parameters of random forests also include classes. In the subsequent model training and testing, the value of each parameter is shown in Table 1.
After setting the parameters of the model, we trained and tested the model. To better identify the efficiency of the batch model in the medical dishonesty behavior, this paper uses logistic regression (LR) and k-neighbor algorithm (k-NN)   for comparison. e recognition accuracy of each algorithm is shown in Table 2.
In Table 2, the actual rate is the rate predicted by the positive model for the sample. e true negative rate is the negative rate but is predicted to be positive by the model. It can be seen from Table 2 that the actual rate of the RF algorithm is 12.9%, which is an improvement over k-NNLR. e accuracy rate is 1.4% and 1.3% higher than that of k-NN and LR, respectively. e true negative rate has dropped to a certain extent. It can be seen that the RF algorithm has better performance when recognizing unpredictable medical behaviours.

Conclusion
is article is based on the idea of machine learning and data mining to research medical and health data. e article establishes a prevention and monitoring model based on the random forest algorithm. is article focuses on the input features used in related medical models. In addition to combining the historical medical information of medical participants, we also introduce the patient's social credit status, which can effectively compensate for the behavior identification of medical record personnel and the prevention of dishonesty. e random forest algorithm used in this article can avoid the overfitting phenomenon in the training process and improve the prediction accuracy. e content of this article has certain practical significance for the behavioral norms of medical market participants.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.