Study on TCM Syndrome Identification Modes of Coronary Heart Disease Based on Data Mining

Coronary heart disease (CHD) is one of the most important types of heart disease because of its high incidence and high mortality. TCM has played an important role in the treatment of CHD. Syndrome differentiation based on information from traditional four diagnostic methods has met challenges and questions with the rapid development and wide application of system biology. In this paper, methods of complex network and CHAID decision tree were applied to identify the TCM core syndromes of patients with CHD, and to establish TCM syndrome identification modes of CHD based on biological parameters. At the same time, external validation modes were also constructed to confirm the identification modes.


Introduction
Coronary heart disease (CHD) is one of the most important types of heart diseases because of its high incidence and high mortality. With the improvement of people's living standards, the prevalence tendency of CHD is rising and the population of youths suffering from CHD is growing. Coronary angiography has been considered as the "golden standard" in CHD diagnosis. CHD was called "thoracic obstruction" in TCM, with a variety of etiological factors; various clinical manifestations and complex syndromes [1]. Syndrome research has always been hot and difficult spots in TCM basic studies. Syndrome differentiation based on information from traditional four diagnostic methods has met challenges and questions with the rapid development and wide application of system biology. The "golden standard" of syndrome diagnosis has not been found yet. The large number and complexity, multilevel relationships of information from four diagnostic methods had constrained the accuracy of syndrome differentiation. Currently, the application of quantitative modes and data mining is developing rapidly [2,3]. These technologies had provided approaches and methods for TCM syndrome differentiation. Our earlier study showed that the characteristics of information from four diagnostic methods above were in line with that of complex networks; not only in common with special nature on the basis of their own evolutionary mechanisms, but also closely contacted with nature and structural features. We also found that biological parameters could be considered as a reflection of the pathomechanism and physiological mechanism, which might be a reflection of syndrome in TCM too [4]. What is more, we have established a mode conducted by four biological parameters which could distinguish CHD patients with blood stasis syndrome from nonblood stasis syndrome patients by means of C5 Decision Tree [5]. This study indicated that core TCM syndromes could be identified by complex networks and biological parameters could be serviced as syndrome identification mode in CHD patients with the method of decision tree. four diagnostic methods above by means of distance-based mutual information model (DMIM) [10]. Then, we established 120 association relationships among 69 symptoms from four diagnostic methods. The association data was consolidated into adjacency matrix and then converted into the format that Pajek software required.

Measurement of Network Properties and Complex Network
Mapping. Pajek software 2.0 was used to analyze the node degrees and node core values of the four diagnostic information network. With the command of "Layout-Energy-Kamada-Kawai-Separate Components," we drew the K-core network figures according to different colors and different degrees, mediated positions of the nodes with manual operation. Nodes and edges of the network could not be deleted. Then, we exported the network figures in Bitmap format.

Construction of Identification Modes and Validation.
Data standardization was used to analyze information of the cases from different hospitals. Next, we establish two identification modes of CHD core syndromes by chi-square automatic interaction detection (CHAID) decision tree. "Qi deficiency" and "phlegm-blood stasis" were considered as dependent variable and 90 biological parameters were independent variables. We set "Parent Node" 50 and "Child Node" 25, allowing the tree model to grow sufficiently. 10fold cross-validation was used in this research to minimize the bias produced by random sampling of the training and test data samples.

Construction of External Validation
Modes. 212 patients were selected from the 411 cases of CHD to establish new decision tree modes for external validation. Similarly, "qi deficiency" and "phlegm-blood stasis" were considered as dependent variable. 8 and 6 biological parameters got from research above were severed as independent variables. Due to the reduction in the number of independent variables, we set "Parent Node" 2 and "Child Node" 1 to allow the tree model growing much sufficiently. 10-fold cross validation was also used in this section for a validation.

Results of Four Diagnostic Information Network Properties.
Properties of four diagnostic information results showed that degree values of 69 nodes were from one to eleven. The degree values of subglossal collateral vessels engorgement, amnesia, faint low voice, white phlegm, heavy limbs, short breath, cough, anorexia, tastelessness in the mouth, as well as swollen tongue body were greater than six, and they indicated the core syndromes of CHD. The core deficiency syndrome was qi deficiency, and the core excessive syndrome was phlegm-blood stasis. From the results of network cores analysis, we found the core values of 31 four diagnostic information nodes were three. These nodes formed a 3-core network together (Table 1).  was defined as the number of other nodes that connected to it. In a complex network, the greater degree value that one node was, the more significant role it had played. In qi deficiency syndrome, short breath, hypodynamia, faint low voice, weak pulse, as well as tastelessness in the mouth had the maximum degree values. Swollen tongue body, white phlegm, and cough had played major roles in phlegm turbid syndrome. Subglossal collateral vessels engorgement was the key nodes in blood stasis syndrome, whose degree value was eleven. In this network, we can determine the importance of the nodes by combinations of degree values and core values.

Results of Identification Mode for Qi Deficiency Syndrome.
Using the CHAID decision tree, an identification mode of qi deficiency was built with eight biological parameters. They were urine crystal (X TAL), erythrocyte distribution width-CV (RDW-CV), potassium ion (K), thyroid stimulating hormone (TSH), monocyte (MONO), high sensitive Creactive protein (hs-CRP), low-density lipoprotein (LDL), and A peak in echocardiography. In this mode, there were 19 nodes and 11 terminal nodes and the tree depth was 3. From this mode we believed that the X TAL was the best predictive variable quantity of qi deficiency syndrome among the 8 parameters. However, we could not distinguish qi deficiency and non-qi deficiency completely only by X TAL. We could fall back on the second grade variable quantities: RDW-CV and MONO. The third grade variable quantities included K, TSH, hs-CRP, LDL, and A peak ( Figure 3).

Results of Identification Mode for Phlegm-Blood Stasis
Syndrome. Identification mode of phlegm-blood stasis syndrome was made up of six properties: high-sensitive Creactive protein (hs-CRP), total bilirubin (TBIL), glutamyltranspeptidase (GGT), platelet (PLT), fasting blood glucose (FBG), and P-R interval. The depth of this mode was 3. There were 14 nodes and 8 terminal nodes. The 6 parameters formed eight identification paths for phlegm-blood stasis syndrome. The best identification variable of the mode was hs-CRP. Hs-CRP was the only effective variable to identify phlegm-blood stasis syndrome if the value of hs-CRP was between 0.07143 and 0.10714. The second grade variable quantities were TBIL and FBG. The third grade variable quantities were GGT, PLT, and P-R interval ( Figure 4).

Results of Validation for 411
Patients. The result of 10fold cross-validation showed that in qi deficiency syndrome mode, 302 cases were predicted correctly, while the other 109 cases were wrong classified. The sensitivity and specificity of this mode were 70.2% and 77.4%. The percentage of correct prediction was 73.5%. In phlegm-blood stasis syndrome mode, 328 cases were predicted correctly, the other 83 cases were wrong classified. The sensitivity and specificity of this mode were 72.5% and 81.3%. The percentage of correct prediction was 79.8% (Table 2).   an external validation mode of qi deficiency for 211 CHD patients was made up of six biological parameters. Unfortunately, this mode was lack of the parameters of hs-CRP and RDW-CV though we had made the tree model grow effectively as much as possible. The number of nodes in this mode was 18, and the number of terminal nodes was 10. MONO was the best predictive variable quantity of qi deficiency syndrome ( Figure 5).

Results of External Validation Mode for Phlegm-Blood
Stasis Syndrome. External validation mode of phlegm-blood stasis syndrome included the same six properties compared with the identification mode above. There were 23 nodes and 14 terminal nodes in this mode. The mode was much more complex, for these 6 parameters formed 12 identification paths for phlegm-blood stasis syndrome. The best identification variable of the mode was still hs-CRP.
The second grade variable quantity was P-R interval, and the third ones were the remaining four quantities ( Figure 6).

Results of Validation for 212
Patients. The result of 10fold cross-validation showed that in qi deficiency syndrome, external validation mode, the sensitivity and specificity were 69.8% and 73.3%. The percentage of correct prediction was 71.7%. In phlegm-blood stasis syndrome external validation mode, the sensitivity and specificity were 86.8% and 75.9%. The percentage of correct prediction was 77.8% (Table 3).

Discussion
Data mining is a method of extracting the database which is still unknown while useful information is implied potentially. It establishes a computer program, automatically scrutinizes in the database and tries to find modes or rules           Note: sensitivity = TP/(TP + FN); specificity = TN/(TN + FP); accuracy = (TP + TN)/(TP + FP + TN + FN). [11]. Complex networks can be used to describe the social relations among persons, kinships, network connections among computers, semantic relations among words, relations of cooperation between scientists, and so forth [12][13][14]. With the suggestion of small world network concept by Watts and Strogatz in 1998 [15], and with the development of pioneering study on scaling in random networks byBarabási and Albert [16], more and more researchers had used complex networks in medical field. For example, researches on connection of the brain function [17], propagations of the diseases [18], studies of the drug efficacy and drug targets [19], gene regulatory networks [20], and interactions of protein [21]. The traditional approaches could not reveal the meaning of the four diagnostic information because the contents of them were numerous and the combination rules and relationships among the information were complex. TCM is a traditional medicine that capturing the variations of the disease based on the concept of wholism. Studies have shown that the diseases symptom networks had the characteristics of TCM syndromes classification [22]. In complex networks, the classification features, the demands of each role in the Non-phlegm-blood stasis Phlegm-blood stasis network organization and the relations of the elements in the progress of organization constitute are the potential force of the network [23]. The process of clinical diagnosis and treatment in TCM are also very similar to complex system. In the analysis of relationships among syndrome, therapeutic and Chinese herbal medicine, the main syndromes and monarch drugs were similar to the hubs of the network, the therapeutic methods and therapeutic principles were abstract summarization of the complex relations [24]. As one of the data mining methods, complex networks provided new methods and ideas for the studies of TCM. It explained the integrity, nonlinearity, and dynamic character of TCM from another point of view.
Pajek is a software that can analyze the data very fast and effective and a kind of simulation for complex network. Unlike the common network analysis software, Pajek can deal with the large-scale networks that contain millions of nodes and have broken the bottleneck that numbers of network analysis software can only process the small-scale data. It usually extracts small-scale networks from the large-scale ones in order to achieve a more detailed study by the classical algorithm and display the analysis results through powerful visualization capability [25,26]. In many complex networks, there is a phenomenon that although the node number of the network is very large, but the "core" node number is still very small for the entire complex network. Intuitively, the "core" refers to the nodes that play important roles in the complex network. In a network, if any of the nodes has k neighbors that were also in this network, then the network is called "kcore network." Researching the core of the complex network is to identify the entire "k-core network" in the complex network.
In our study, complex network was employed to identify the TCM core syndromes of CHD patients. The core syndromes included qi deficiency subjected to the deficiency syndromes, phlegm-blood stasis syndrome belonged to excessive syndromes. There are two reasons for the conclusions. Firstly, in this study, we gained a 3-core network, in the center of which arranged 31 nodes. These nodes played a major role. Among these nodes, there were 13 nodes reflected the qi deficiency syndrome, and 10 nodes reflected the phlegm-blood stasis syndrome. The 8 remaining nodes represented yin deficiency, yang deficiency, qi stagnation, and heat syndromes. Secondly, the degree value is a simple but most important property of complex network. The degree value of one node is defined as the number of other nodes that connected to it. In a complex network, the greater degree value that one node is, the more significant role it plays. In the network of four diagnostic information, the degree values of subglossal collateral vessels engorgement, amnesia, faint low voice, white phlegm, heavy limbs, short breath, cough, anorexia, tastelessness in the mouth, and swollen tongue body were higher than 6, most of which reflected the core syndromes we mentioned. The identifications of these core syndromes accurately laid the foundation for the constructions of syndrome identification modes by biological parameters.
Decision Tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is a way to display an algorithm. Decision trees are usually applied to cost-benefit studies, especially in decision-making analysis, to help identify a strategy most likely to reach a goal [27]. In many fields of clinical medicine, decision trees have been used successfully to solve complex and chaotic problems without mathematical models or a precise understanding of the mechanisms involved, such as genetic and molecular sequence analysis [28], hospital information system mining [29], and health care [30].
Chi-squared automatic interaction detector (CHAID) decision tree is a method of chi-square automatic interaction detection put forward by Kass in 1980 for the analysis of classification data [31]. It has the functions of target selection, variable selection, and clustering. Its core idea is to split the cases optimally according to the response variables and screened explanatory variables and to determine the grouping automatically of multiple contingency tables on the basis of significance results from chi-square test. The classification process of CHAID algorithm is described as follows. First, select the response variable of category, crossclassification goes into explanatory variables and response variables, then results in a series of two-dimensional classifications. Calculate the χ2 value of the two-dimensional classification, compare the P value. The best initial twodimensional classification table with the minimum P value comes into being. Explanatory variables will continually be used to classify the response variables based on the best twodimensional classification table. Repeat the process until the P value is greater than α value, then the classification stops and mode is formed [32]. Our previous results showed that CHAID decision tree can analyze the large and dormant data from clinical information due to the nonlinear relationship and the interactions between blood stasis syndrome and biological parameters.
The methods of syndrome studies cannot be completed without modern medicine. Due to the complexity itself, it is hard to find the "golden index" for syndrome identification. However, the combinations of different biological parameters may demonstrate the characteristics of different syndromes. Data mining methods have solved those problems mentioned above, which make it possible that macroinformation and microinformation could be combined effectively. Using the CHAID decision tree, an identification mode for qi deficiency syndrome was established with eight biological parameters, and another identification mode for phlegmblood stasis syndrome was constructed with six biological parameters in our research. We could diagnose patients with or without qi deficiency by 11 paths and diagnose phlegmblood stasis syndrome by 8 paths.
Studies showed that hs-CRP was significantly increased in CHD patients and had a moderate predictive value for CHD. It had a correlation with phlegm-blood stasis syndrome and provided objective basis for phlegm-blood syndrome differentiation [33]. Meanwhile, significant positive correlation was observed between hs-CRP and qi deficiency syndrome [34]. TBIL is a harmful metabolite in the body under the traditional view. In recent years, domestic and foreign researches showed that TBIL, as a kind of physiological oxidant, had played a role in antiarteriosclerosis. Low express of serum TBIL is an independent risk factor of CHD [35]. Serum GGT value may be the index of oxidative stress in vivo. The elevation of GGT can predict the myocardial infarction and stroke, and reflect the cell damage caused by oxygen free radical [36]. When the activated platelet adheres to the vessel wall, the platelet dusts (endothelial granules) are released. This process is closely related to the occurrence of CHD [37]. Study on the relationship between CHD and FBG proved CHD patients were more easily combined with abnormalities of FBG [38]. A correlation study on TCM syndromes and ventricular diastolic functions showed E peak decreased and A peak increased significantly in qi deficiency patients. It prompted the dysfunctions of heart early filling [39]. Some scholars believed that the elevation of RDW suggested the underlying inflammation of the body. Inflammation is one of the most important mechanisms of atherosclerosis. Increase of RDW may be a predictor of the CHD severity [40,41]. The physiological functions of CHD patients with qi deficiency syndrome were weakened. When the promotion effect of qi was weakened, growth and development of the body would be hurt, physiological functions of the meridian and viscera declined for the earlier failure. Study showed in CHD patients with qi deficiency syndrome, the thyroxine (TH) decreased the ability to feedback regulate the pituitary. Correlations were found between TSH and CHD with qi deficiency syndrome [42]. Compared with healthy people, mononuclear cell count of CHD patients often increased. MONO may be the pathogenesis of CHD. Increases of MONO may indicate earlier happens of CHD especially in middle-aged people [43]. LDL is a reflection of the severity of coronary artery lesions. Its level increased with the aggravation and the severity of coronary lesions. Considering the prevention and treatment of CHD and the physiological need level of LDL, some scholars put forward the proper LDL level for 1.3-1.8 mmol/L [44].
In summary, it showed that application of CHAID decision tree may provide more biological indicator basis for 10 Evidence-Based Complementary and Alternative Medicine TCM syndromes differentiation, which may also pave a way for further research on TCM syndrome.

Conclusion
Complex networks contributed a lot in the identification of the core TCM syndromes of CHD patients. We found that qi deficiency syndrome and phlegm-blood stasis syndrome were the basic syndromes of CHD patients in our study. Moreover, we established syndrome identification modes for CHD patients with or without core syndromes by CHAID decision tree. Qi deficiency identification mode included eight biological parameters: X TAL, RDW-CV, K, TSH, MONO, hs-CRP, LDL, and A peak. The accuracy of this mode was 73.5%, the sensitivity was 70.2% and specificity was 77.4%. The identification mode of phlegmblood stasis syndrome included 6 biological parameters: hs-CRP, TBIL, GGT, PLT, FBG, and P-R interval, and the accuracy of this mode was 79.8%, the sensitivity was 72.5%, and the specificity was 81.3%. Constructions of the two external validation modes improved further reliabilities of the identification modes.