Study of TCM Syndrome Identification Modes for Patients with Type 2 Diabetes Mellitus Based on Data Mining

Objective To establish the diagnosis model for syndromes of type 2 diabetes mellitus (T2-DM) and explore symptoms, the pulse and tongue signs, and laboratory indexes related to syndromes of T2-DM. Methods A syndromatologic and laboratory investigation was conducted in 554 T2-DM patients with 58 symptoms, 14 tongue signs, 6 pulse signs, and 12 laboratory indexes. The clinical data on the syndrome were collected and analyzed by using logistic regression analysis, decision tree, and K-nearest neighbor to establish a diagnostic model for effectively distinguishing the typical syndromes in T2-DM patients. Results The most typical syndromes revealed in T2-DM were stomach heat flourishing (SHF) syndrome (261 patients, accounting for 47.1%) and Qi-Yin deficiency (QYD) syndrome (293 patients, 52.9%). According to the clinical data of the patients with these two syndromes, variables including 6 symptoms and signs, 2 pulse signs, 1 tongue sign, and 2 laboratory indicators were introduced into the logistic regression model. All of them were statistically significant. Then, a diagnostic model constructed by QUEST and CHAID algorithms of the decision tree for identifying the two syndromes was proved to have an accurate diagnostic rate of 85.2%. It was found that the following sign and symptoms were effective to differentiate these two syndromes: odor in the mouth, polyphagia, vulnerability to starvation, burning sensation in the stomach, fatigue, limb weakness, slippery and replete pulse, weak pulse, pink tongue, oral glucose tolerance test, and hemoglobin A1C. A classification model constructed by the K-nearest neighbor method to identify the two syndromes showed an accurate diagnostic rate of 88.3%. Three major statistically significant predictors included in the model were slippery and replete pulse, polyphagia, and weak pulse (P < 0.05). Conclusion A model for distinguishing the two typical syndromes (SHF syndrome and QYD syndrome) in T2-DM patients was effectively established. This model could help to provide methodological support for the standardization of traditional Chinese medicine (TCM) syndrome differentiation methods.


Introduction
Type 2 diabetes mellitus (T2-DM) is a chronic metabolic disease, causing significant increases in morbidity and mortality [1]. e global prevalence of diabetes has risen in adults from 4.7% in 1980 to 12.8% in 2018 [2]. e prevalence of type 2 diabetes in China is as high as 10.4%, showing an upward trend [3,4]. Traditional Chinese medicine plays a more and more important role in the differential diagnosis and treatment of T2-DM [5].
Traditional Chinese medicine (TCM) is characterized by syndrome differentiation and treatment, which has been valuable to individualized clinical diagnosis and treatment. However, it is difficult to provide objective syndrome diagnosis by experiences. Also, previous clinical studies are not enough to reflect the characteristics of TCM. erefore, it is crucial to study the objective standards for type 2 diabetes syndrome.
T2-DM is characterized by polydipsia, polyuria, and emaciation. Stomach heat flourishing (SHF) syndrome and Qi-Yin deficiency (QYD) syndrome are typical syndromes of T2-DM in clinical research of traditional Chinese medicine. In the early stage, swift digestion with rapid hungering and polydipsia is the outstanding characteristic of QYD syndrome [6]. With the aggravation of stomach heat consumption and yin injury, the symptoms of polydipsia, polyuria, and emaciation appear one after another. Stomach heat injures the kidney, resulting in kidney qi deficiency and kidney yin deficiency. Deficiency of kidney qi has no right to make urine clear and long, and kidney yin deficiency leads to yang hyperactivity, deficiency fire inflammation, that is, dizziness and tinnitus, insomnia, weakness of the waist and limbs, hot flashes, night sweats, and other yin deficiency syndromes, yin damage and yang deficiency, kidney yang deficiency and impotence, edema, and other symptoms. Deficiency of qi, deficiency of yin, and deficiency of yang can lead to the obstruction of blood flow and blood stasis, and blood stasis can cause many complications. In short, stomach heat is the key link in the pathogenesis of diabetes, and a variety of causes cause stomach heat, stomach heat injury yin, and gas consumption caused by both qi and yin injuries [7]. e evolution of T2-DM can be summarized as the beginning of the hyperactivity of stomach heat, followed by the injury of QYD syndrome. SHF syndrome is excess in superficiality, while QYD syndrome is the syndrome of deficiency in origin. erefore, it is more significant to explore these two syndrome types in clinics.
Furthermore, there are growing statistical models and data mining methods that have been used in medical research [8][9][10]. In recent years, a variety of data analysis methods had also been used in the quantitative and objective diagnosis of TCM syndrome [11,12]. Particularly, various researchers utilize the data mining method to extract the core attribute indicators of syndromes of T2-DM through the epidemiological investigation of clinical routine detection indicators of T2-DM. ey formed a clear and instinctive judgment mode of the indicators made up for the inadequacy of traditional statistical methods [13,14].
Our previous study found that the effective combination of clinical symptoms is helpful in the diagnosis of kidney yang deficiency syndrome and kidney yin deficiency syndrome. And the specific mixture of symptoms can mirror the situation of the patients with kidney deficiency [15]. erefore, it is not only necessary but also possible to establish a syndrome diagnosis model constructed on data through the reasonable fusion of multiple data analysis methods [16][17][18]. Consequently, on the basis of a large number of clinical samples, combined with a variety of methods, the typical syndrome characteristics and effective index combination of T2-DM patients were analyzed, and the diagnosis model of two typical syndromes of T2-DM patients was established. At present, we established a diagnosis model of syndromes of T2-DM, explored symptoms, pulse and tongue signs, and laboratory indexes related to SHF syndrome and QYD syndrome by using the data mining methods, and compared the diagnostic power of three classification algorithms.

Diagnostic Criteria for T2-DM.
According to the "Guidelines for Prevention and Treatment of Type 2 Diabetes in China" (2017 Edition), the World Health Organization (WHO) (1999) diagnostic criteria for diabetes that are currently adopted in China, the epidemiological survey adopts fasting blood glucose 2 hPG or 75 g blood glucose after OGTT [4]. e conditions were as follows: (1) those with typical diabetes symptoms, such as polyuria, polyphagia, excessive drinking, and unexplained weight loss; plasma glucose level ≥11.1 mmol/L (200 mg/dL); (2) fasting plasma glucose (FPG) ≥7.0 mmol/L; (3) 2 h plasma glucose ≥11.1 mmol/L (200 mg/dL) using 75 g OGTT. If one of the above three conditions was met, the patient can be included, and the diagnosis can be confirmed by repeated tests.

TCM Diagnostic Criteria for T2-DM.
e criteria of TCM diagnosis and syndrome differentiation of T2-DM are based on the terms of clinical diagnosis and treatment of TCM syndrome [19], guideline for clinical research of new drugs of TCM [20], diagnostics of TCM [21,22], and internal medicine of TCM [23]. SHF syndrome includes dry mouth, dry throat, thirst, excessive drinking, frequent urination, constipation of stool, polyphagia, vulnerability to starvation, obvious weight loss, yellow fur, and smooth and slippery pulse. Symptoms such as polyphagia, vulnerability to starvation, emaciation, yellow fur, and smooth and slippery pulse can be diagnosed as indications of stomach heat flourishing syndrome. When there are three or more of the above symptoms identified, accompanied by reddish tongue with yellowish fur, slippery and rapid pulse can be diagnosed. QYD syndromes include dry mouth, dry throat, tired spirit, thirsty for drinking, poor appetite, spontaneous perspiration, thin body, reddish tongue, less moss, and weak pulse. When there are three or more of the above symptoms identified, accompanied by reddish tongue, weak pulse can be diagnosed.

Inclusion Criteria.
e inclusion criteria were as follows: (1) patients aged from 18 to 70 were able to complete the questionnaire survey; (2) patients consistent with the diagnostic criteria of T2-DM developed by the WHO in 1999 and diagnosed as T2-DM at the time of the first diagnosis; (3) all patients had clear consciousness, had normal intelligence, and could accurately understand and answer the questions; (4) all patients signed informed consent; (5) all patients volunteered to participate in the investigation.

Exclusion Criteria.
(1) Patients with type 1 diabetes or other diseases that cause elevated blood glucose, such as gestational diabetes, drug-induced diabetes, and severe liver disease caused by diabetes, stress hyperglycemia, increased glucocorticoids, and severe acute complications of diabetes; (2) patients with diabetic nephropathy stage IV or V and diabetic foot; systolic blood pressure >160 mmHg or diastolic blood pressure >100 mmHg after uncontrolled or controlled blood pressure; (3) patients with complications such as severe heart, lung, liver, acute, and chronic glomerular diseases, renal failure, acute and severe cardiovascular and cerebrovascular diseases, or with other serious primary diseases such as thyroid diseases (onset time <1 month); (4) women who are pregnant or lactating; (5) allergic constitution; (6) psychotic patient; (7) patients suffering from acute metabolic disorders such as trauma, surgery, hyperosmolar coma, and diabetic ketoacidosis in the last month.

Clinical Observation and Test Indexes.
From the perspective of TCM and Western medicine, 58 symptoms and signs, 14 tongue signs, and 6 pulse signs were extracted by analyzing the literature of T2-DM in China National Knowledge Infrastructure (CNKI) and PubMed, and each patient was examined clinically. e general information of each patient included gender, nationality, marital status, age, occupation, education, past medical history, family history, height, weight, and blood pressure. e following laboratory indexes were investigated: fasting plasma glucose (FPG), 2hour postprandial blood glucose, glycosylated hemoglobin A1c (HbA1C), oral glucose tolerance test (OGTT), total cholesterol (TC), triglyceride (TG), low-density lipoprotein (LDL), highdensity lipoprotein (HDL), blood urea nitrogen (BUN), creatinine (CR), urinary glucose (UG), and urinary protein (UP).

Questionnaire Quality Control
(1) Before the clinical and epidemiological investigation, under the guidance of physicians, the research manual of clinical investigation was carefully read by the investigators. As unified training was conducted and the analysis scheme was strictly implemented by an analyst, a measurement bias was reduced; (2) at least 2 clinical investigators shall be assigned to coordinate and supervise the work, and the case data shall be checked and improved regularly; (3) the questionnaire of T2-DM syndrome epidemiological information was uniformly used.

Data Management and Statistical
Analysis. EpiData 3.1 is software used for the data management. e data were arranged in a column-wise format with each subject given a sequence identifier. Following the principle of independent input of two people, the questionnaire data were keyed by two people. In this study, continuous data with normal distribution were expressed as mean ± standard deviation (SD), while continuous data with no normal distribution were expressed as median (lower quartile, upper quartile). Data were expressed as a number and a percentage for categorical variables. If continuous data met the normal distribution and homogeneity of variance, the comparison of continuous variables between the two groups was performed by unpaired t-test; if continuous data did not fulfill the normal distribution or homogeneity of variance, the comparison of continuous variables between the two groups was performed via Wilcoxon rank-sum test. e comparison of categorical variables between the two groups was assessed by chi-square test or Fisher's exact test. e method of logistic regression analysis in combination with quick, unbiased, efficient, statistical tree (QUEST) algorithm analysis was used in the study. 90 variables with multivariate were analyzed to compare the differences between SHF syndrome and QYD syndrome. ese 90 variables were defined as independent variables while syndrome as a dependent variable. ey were examined in a multivariate model by using forward stepwise maximum likelihood logistic regression to identify the symptoms (α � 0.05). Odds ratios (ORs) were estimated by multivariate logistic regression analysis. As shown in Table 1, the 90 variables were collected, and a complete clinical questionnaire of the TCM symptom set was constructed. All reported P values were those of two-sided tests. e statistical significance was set at P < 0.05. e statistical algorithm that selected variables and quick, unbiased, efficient, statistical tree (QUEST) and CHAID algorithm analysis were used to develop the decision tree models. QUEST decision tree was a nonparametric procedure that made no assumptions of the underlying data. is algorithm determined how categorical independent variables can be combined without bias to predict a binary outcome based on the "if-then" logic and to build accurate binary trees quickly and efficiently. T2-DM syndrome was considered as a dependent variable, and 90 biological parameters were independent variables. However, we set "Parent Node" 100 and "Child Node" 50, allowing the tree model to grow sufficiently. Data were analyzed by using statistical software of SPSS version 25.0 for the logistic regression and decision tree model. e K-nearest neighbor (KNN) method is an algorithm for classifying variables regarding the closest training data in the feature space. KNN is an instance-based learning method, which is one of the simplest algorithms among data mining methods. is method considers the nearest neighbors to each object and decides to dedicate the object to classes Evidence-Based Complementary and Alternative Medicine 3 [24,25]. In this paper, 10-fold cross-validation method was employed; that is, the dataset was divided into 10 parts, of which nine were taken in turn as the training set, the other was taken as the test set, and the average value of the results was used as the evaluation value of the algorithm performance.

Multivariate Logistic Regression Analysis of Relevant Symptoms in Patients with the SHF Syndrome and QYD
Syndrome. As shown from Table 2, logistic regression analysis showed that odor in the mouth, polyphagia, vulnerable to starvation, burning sensation in the stomach, slippery and replete pulse, OGTT, and hemoglobin A1C were relevant symptoms for the SHF syndrome. On the contrary, the symptoms, such as fatigue, limb weakness, weak pulse, and pink tongue were related with the QYD syndrome. e most significant symptoms of the differences between SHF syndrome and QYD syndrome were odor in the mouth, polyphagia, vulnerable to starvation, burning sensation in the stomach, slippery and replete pulse, OGTT, hemoglobin A1C, fatigue, limb weakness, weak pulse, and pink tongue (P < 0.05). As shown in Table 3, it was the result based on 554 cases of patients with symptoms. ey were generated by the logistic regression model, which showed that 497 patients were classified accurately. e diagnostic accuracy was 89.7%. e sensitivity was 89.3%. e specificity was 90.1%. e ratio of SHF syndrome of married patients with T2-DM is higher than that of QYD syndrome.

QUEST Algorithm of Decision Tree Analysis: Establishment of the Identification Model on T2-DM Syndrome and
Validation. e identification models of T2-DM syndrome were constructed by the QUEST decision tree, one of the algorithms of decision tree analysis. T2-DM syndrome was considered as a dependent variable, whereas 11 attributes of TCM (including 6 symptoms, 2 pulse signs, 1 tongue sign, and 2 laboratory indexes) were labeled as independent variables. "Parent Node" 100 and "Child Node" 50 were set up which allowed the tree model to develop sufficiently. e decision tree algorithms divide data into statistically significant subgroups, which are exclusive and detailed to both parties [24]. In order to increase the operability of clinical application, the number of branches of the decision tree is limited to 4. As shown in Figure 1, in this model, the tree analysis showed the 2-level QUEST decision tree with a total of 7 nodes, of which 4 were terminal nodes. ree major predictors of symptoms which reached significance and were included in this model were demonstrated as slippery and replete pulse, polyphagia, and weak pulse. e other 57 symptoms, 14 tongue signs, 4 pulse signs, and 12 laboratory indexes were not significant in the model. As shown in Table 3, the diagnostic model used to differentiate these two types of T2-DM among 554 cases had an overall accurate diagnostic rate of 85.2%, with the sensitivity of 78.2% and specificity of 91.5%, respectively. e first level of the QUEST decision tree was split into two initial branches in terms of the first level on slippery and replete pulse. e symptom of slippery and replete pulse was the best symptom to identify SHF syndrome, and the classification accuracy of SHF syndrome was at 89.1%. On the contrary, 82.5% of patients with no slippery and replete pulse were identified as QYD syndrome. As seen in the second level of the QUEST decision tree, weak pulse was the next best predictor variable for cases. e classification accuracy of QYD syndrome was 99.0% for patients with weak pulse. At the same time, accurate diagnostic rate of SHF syndrome for patients with polyphagia was 100%.

Results of CHAID Algorithm of Decision Tree Analysis: External Validation Mode for T2-DM Syndrome.
With the CHAID decision tree method, an external validation mode for SHF syndrome and QYD syndrome for 554 T2-DM patients was made up of three biological parameters. As shown in Figure 2, the number of nodes in this mode was 9, and the number of terminal nodes was 5. e mode was much more complex for these 3 parameters formed 8 identification paths for SHF syndrome and QYD syndrome. Smooth and replete pulse was the best predictor of SHF syndrome and QYD syndrome. e second-grade variable quantity was polyphagia and weak pulse, and the third-grade variable was polyphagia. As shown in Table 3, the result of 10-fold cross-validation was shown for SHF syndrome and QYD syndrome in the external validation mode with the sensitivity of 78.2% and specificity of 91.5%, respectively. e percentage of correct prediction was 85.2%.

Results of the K-Nearest Neighbor Method: Identification
Model on T2-DM Syndrome. As shown from Table 3, the classification model used to differentiate the two types of T2-DM among 554 cases had an overall accurate classified rate of 88.3%, with the sensitivity of 84.7% and specificity of 91.5%, respectively. ree major predictors of symptoms were the nearest neighbors to the two types of T2-DM in the model which were demonstrated as slippery and replete pulse, polyphagia, and weak pulse.

Comparison of the Area under the ROC Curve of ree Classification Methods.
e sensitivity, specificity, and AUC of three classification methods have been demonstrated in Table 3. Area under the ROC curve (AUC) computes the entire two-dimensional area under the whole ROC curve. According to the finding, AUC dedicated to logistic regression is bigger among the methods. Apart from classification sensitivity, specificity, and AUC, the receiver operating characteristic (ROC) is shown for each approach in Figure 3. A larger area under the curve (AUC) is usually better. According to the demonstrated ROCs, the logistic regression has a better area under the curve in comparison with the decision tree and K-nearest neighbor method.

Discussion
T2-DM is a common and frequent disease of endocrine metabolic disease, belonging to the category of Xiao Ke disease in TCM. "Synopsis of the Golden Chamber" written by Zhang Zhongjing in the Eastern Han Dynasty is the earliest classical book of traditional Chinese medicine. It points out that "fighting between the hard and the weak" means that the stool is hard, and frequent urination is the key point of pathogenesis. It reflects that excessive lung and stomach heat leads to excessive drinking, and swift digestion with rapid hungering is excess in superficiality, while Qi-Yin deficiency due to unfavorable gasification is deficiency in origin. Intake of water cannot be retained in the body; it will become body fluid and be excreted from the kidneys. e earliest syndrome types of T2-DM can be divided into two types: SHF syndrome and QYD syndrome. SHF syndrome is excess in superficiality, while QYD syndrome is the syndrome of deficiency in origin. erefore, it is necessary to establish an identification model of typical syndromes of T2-DM, SHF syndrome and QYD syndrome, by using the data mining methods.
Several studies have investigated different data mining methods to identify Chinese medicine syndrome of different diseases such as metabolic syndrome, coronary  heart disease, IgA nephropathy, chronic hepatitis B, and acute exacerbation of chronic obstructive pulmonary disease [9,11,12,26,27]. In the present study, the clinical manifestations of 554 cases with T2-DM syndrome are complicated, which is the observation basis of TCM syndrome differentiation treatment. In the clinical investigation, SHF syndrome and QYD syndrome are the two main syndrome types of T2-DM. e diagnostic models of SHF syndrome and QYD syndrome with T2-DM in patients were established and compared by using logistic regression, decision tree, and K-nearest neighbor method. Logistic regression analysis showed that odor in the mouth, polyphagia, vulnerable to starvation, burning sensation in the stomach, slippery and replete pulse, Evidence-Based Complementary and Alternative Medicine OGTT, hemoglobin A1C, fatigue, limb weakness, weak pulse, and pink tongue were the most significant symptoms of the differences between SHF syndrome and QYD syndrome (P < 0.05). However, the decision tree and Knearest neighbor method are more consistent in dealing with the relationship among only three symptoms selected: slippery and replete pulse, polyphagia, and weak pulse. As shown in Figures 1 and 2, there were obvious differences in the occurrence rates of these three symptoms in patients with SHF syndrome and QYD syndrome. erefore, it was reasonable that these three symptoms, such as slippery and replete pulse, polyphagia, and weak pulse, were selected and placed in the decision tree model and K-nearest neighbor model. However, the occurrence rates of six symptoms, odor in the mouth, vulnerability to starvation, burning sensation in the stomach, fatigue, OGTT, and hemoglobin A1C, were not the nearest to the two syndromes, so they were excluded from the decision tree model. e combination of logistic regression and decision tree has also been proved to be effective in modern medical diagnosis [28,29]. In this study, the diagnostic accuracy of the logistic regression model and decision tree model for the two syndromes was basically consistent (logistic regression model was 87.7%, and decision tree model was 80.3%). However, 11 indicators, such as odor in the mouth, polyphagia, vulnerability to starvation, burning sensation in the stomach, slippery and replete pulse, OGTT, hemoglobin A1C, fatigue, limb weakness, weak pulse, and pink tongue were analyzed by using the logistic regression model, while only 3 indicators, such as slippery and replete pulse, polyphagia, and weak pulse were analyzed by using the decision tree model. In addition, the decision tree is a nonparametric method [30], and the representation of its model is easier to understand and more practical, which is also convenient for the actual operation of clinical syndrome diagnosis.
LR, DT, and KNN suggest that symptoms and pulse diagnosis are of great significance in the differentiation of type 2 diabetes. At the same time, slippery and replete pulse, polyphagia, and weak pulse in different diagnostic models have a good effect on the differential diagnosis of SHF syndrome and QYD syndrome, which is consistent with the experience of TCM syndrome differentiation.
e study also showed that the combination of laboratory examination indexes and some specific symptoms of TCM in patients has the value of syndrome differentiation [31]. e study also showed that QYD syndrome was associated with hemoglobin A1C [32]. erefore, it is a new idea to explore objective research studies on the syndrome of some modern laboratory indicators and understand the combination relationship between symptoms of TCM and laboratory indicators. Applying data mining methods to identify T2-DM syndrome diagnosis may assist practitioners to enhance the quality of their clinical decisions.
As shown from the results, KNN has a lower area under the curve in comparison with DT and LR. e structure of the data may lead to the difference. In addition, KNN is a nonparametric learning method, which cannot reflect the influence of each independent variable on the dependent variable. Consequently, DT is fitter to the data. As everyone knows, the fundamental difference between DT models and both LR and KNN is that DT models learn step functions. erefore, identification models of T2-DM syndrome depend on the relationship of variables. DT shows better identification than LR when nonlinear correlation occurs between independent and dependent variables. When there is a step function correlation between the variables, DT model is more suitable for the classification of T2-DM syndrome. So, DT has obvious advantages in expressing the rules of syndrome differentiation, which is suitable for the main technical method of TCM syndrome diagnosis.
In conclusion, the model between SHF syndrome and QYD syndrome in patients with T2-DM was initially constructed in the study in order to provide new methods and new ideas for T2-DM patients to diagnose and treat SHF and QYD syndromes from the perspective of traditional Chinese medicine. e results show that the reasonable combination of some laboratory indexes and TCM syndromes has certain dialectical significance. is suggests that the combination of multiple statistical analysis models is a feasible method to improve the objectivity of syndrome diagnosis. In the future, it is necessary to test the reliability of the model in new clinical patients and carry out the comprehensive investigation of large clinical samples and multiple syndromes to further improve the diagnostic accuracy and stability of the model and comprehensively establish a diagnostic model containing multiple syndromes. Moreover, the relationship between qualitative and quantitative indexes with syndrome differentiation and their biological basis was to be studied.