Classification and Progression Based on CFS-GA and C5.0 Boost Decision Tree of TCM Zheng in Chronic Hepatitis B

Chronic hepatitis B (CHB) is a serious public health problem, and Traditional Chinese Medicine (TCM) plays an important role in the control and treatment for CHB. In the treatment of TCM, zheng discrimination is the most important step. In this paper, an approach based on CFS-GA (Correlation based Feature Selection and Genetic Algorithm) and C5.0 boost decision tree is used for zheng classification and progression in the TCM treatment of CHB. The CFS-GA performs better than the typical method of CFS. By CFS-GA, the acquired attribute subset is classified by C5.0 boost decision tree for TCM zheng classification of CHB, and C5.0 decision tree outperforms two typical decision trees of NBTree and REPTree on CFS-GA, CFS, and nonselection in comparison. Based on the critical indicators from C5.0 decision tree, important lab indicators in zheng progression are obtained by the method of stepwise discriminant analysis for expressing TCM zhengs in CHB, and alterations of the important indicators are also analyzed in zheng progression. In conclusion, all the three decision trees perform better on CFS-GA than on CFS and nonselection, and C5.0 decision tree outperforms the two typical decision trees both on attribute selection and nonselection.


Introduction
With a history of 2000 to 3000 years, Traditional Chinese Medicine (TCM) has formed a unique system to diagnose and cure illness. In TCM, the treatment of illness is based primarily on the diagnosis and differentiation of syndromes [1]. For the TCM theory of "treatment based on zheng differentiation, " a different "zheng" stands for a different syndrome of the disease. A syndrome enables the doctor to determine the development and the location of the disease [2]. Syndrome differentiation is the method of recognizing and diagnosing diseases or body imbalances by analyzing clinical information based on TCM theories and the doctor's experience [3]. TCM has formed a systematized methodology of diagnosis and treatment based on the rich practical knowledge and experience of Chinese people in struggling against diseases. According to the clinical information, practitioners of TCM will perform diagnosis and draw conclusions about the patient's pathological conditions using the term of syndrome (i.e., zheng in Chinese). Chronic hepatitis B (CHB) is a worldwide public health problem for human beings, and TCM shows a positive signi�cance in its treatment and control, but the current standardization of TCM zhengs for CHB has not made a breakthrough. In 1992 and 2002 [4,5], the Medicine Committee for Liver Disease of the Chinese Medicine Institute and the State Drug Administration formulated the trial viral hepatitis TCM standards and the Chinese medical clinical research guidelines for CHB, respectively, but it still lacks an objective standard for the speci�c TCM syndromes of CHB. Physicians cannot apply the prescriptive methodology in a professional standard until or unless they have mastered the zheng differentiation process. is limits the clinical efficacy and wide acceptance of TCM to some extent. erefore, exploring the nature of the TCM zhengs in CHB and the relation between symptoms or lab indicators and a zheng becomes a necessary task for the establishment of the evaluation system for CHB. Modern medical datasets are complex and composed of a great many attributes (symptoms and lab indicators), so they tend to be analyzed through statistics, data mining and other quantitative analysis methods [6][7][8]. In the attempt to achieve an effective and objective standard of zheng diagnosis, researchers have used the data mining approach to construct the classi�er for the TCM dataset, and some research efforts for zheng classi�cation of TCM have been acquired. Xu et al. [9][10][11] employed CCD devices to acquire tongue images, and analyze the tongue color, shape, texture, moisture, and so on. en he established a tongue diagnosis system. Li et al. [12,13] de�ned a �ve-color classi�cation scale in complexion diagnosis based on TCM theories and developed an automatic analysis system for assistance in facial complexion-based diagnosis, which standardizes the facial complexion-based diagnosis and avoids depending on the physician's experience and environment. Some scholars [14,15] have also developed various forms of pulse analysis instruments and studied single and many point sensors for obtaining a mechanical parameter which is helpful for the quantitative analysis of the pulse signals in TCM.
In this paper, the research is based on the clinical investigation of CHB samples. According to CHB lab indicators, it aims to establish the classi�cation approach and explore the relevant indicators in the progression of the TCM zhengs for Damp Heat in the Liver and Gallbladder, Liver Qi Stagnation and Spleen De�ciency, and �in De�ciency of Liver and Kidney. Each case has two or three records of zheng differentiation in the dataset, and the time interval of the records of a case is four weeks. 550 records in the dataset are differentiated by physicians and three TCM experts, respectively. e physicians and experts all agreed on the zheng classi�cation for each of these records. ese records are divided into three zhengs in TCM: 306 records are the Damp Heat in the Liver and Gallbladder Zheng (A zheng), 155 records are the Liver Qi Stagnation and Spleen De�ciency Zheng (B zheng), and 89 records are the �in De�ciency of Liver and Kidney Zheng (C zheng), and the values of "A, B, C" correspond to "1, 2, 3" respectively. e 217 cases in the CHB dataset are chosen as the studied material for the changes of zheng type in their corresponding records. Items in the dataset include 83 lab indicators, such as routine blood tests, urine and liver function indicators (including ALT, AST, GGT, AKP, TBIL, PT, APTT, albumin, and globulin, etc.), viral indicators (HBsAg, HBsAb, HBeAg, HBeAb, HBcAb, HBV-DNA, etc.), immune indicators (CD3+, CD4+, CD8+, etc.), renal function (Cr, BUN), blood glucose, and lipids (TG, TC, etc.).

Material and Methods
In the dataset, each record has 84 attributes, including 83 clinical lab indicators of CHB and 1 TCM zheng label, and the attributes are encoded by the following rules. Missing values of the 83 attributes are less than 5%.

Diagnosis, Inclusion, and Exclusion
Criteria for CHB and TCM Zhengs. e diagnosis criteria for CHB are referred to in "Prevention and treatment programs of viral hepatitis" [16] issued by the Chinese Liver Disease Association and the Society of Infectious Diseases. It includes the following: (1) cases that are hepatitis B with positive HBeAg, HBsAg and HBV-DNA, negative HBeAb, continuous or repeated elevation in ALT, or hepatitis alterations in liver histological examination, (2) cases that are hepatitis B with negative HBeAg, positive HBsAg and HBV-DNA, continuous or repeated elevation in ALT, or hepatitis alterations in liver histological examination.
e inclusion criteria for CHB include the following: (1) it conforms to the diagnosis criteria of chronic hepatitis B, (2) the indicators of ALT or GGT are abnormal, (3) the age ranges from 18 to 65 years old.
e exclusion criteria include the following: (1) cases of hepatitis B combined with another hepatitis virus, (2) cases of chronic severe hepatitis and cirrhosis, (3) cases of pregnant or lactating women, (4) people who cannot express their feelings clearly.
e category criteria (i.e., the inclusion criteria for TCM zheng) for the three zhengs in CHB are referred to in "TCM syndrome differentiation standards of viral hepatitis (Trial)" [17] issued by the Internal Medicine Department Committee of Liver Disease in Traditional Chinese Medicine Evidence-Based Complementary and Alternative Medicine 3 Association. e inclusion criteria are based on clinical features of the three zhengs, and the cases that cannot be diagnosed as one of the three TCM zhengs will be excluded.
(A) Damp Heat in the Liver and Gallbladder. e major features are (1) yellow skin and eyes, and (2) a yellow, greasy tongue coat.
e inclusion criteria for Damp Heat in the Liver and Gallbladder are as follows: (1) cases that have all the major features, (2) cases that have the major feature (1) and two minor features, (3) cases that have the major feature (2) and the minor features (1) and (2). e inclusion criteria for Liver Qi Stagnation and Spleen De�ciency include the following: (1) cases that have all the major features, (2) cases that have the major feature (1) and the minor features (2) and (3), (3) cases that have the major feature (2) and the minor feature (1).
(C) �in De�cienc� of Liver and �idne� �hen�. e major features are (1) dizziness and dry eyes, (2) soreness of loins and so knees, and (3) a red and dry tongue. e minor features are (1) vexing heat in the chest, palms and soles, (2) insomnia, (3) dull pain of lateral thorax, aggravated by labor, and (4) thready and rapid pulse. e inclusion criteria for �in De�ciency of Liver and Kidney include the following: (1) cases that have all the three major features, (2) cases that have two major features and two minor features, (3) cases that have one major feature and three minor features, (4) cases that have all the four minor features.

Methods.
In this paper, the logic process of this research can be generalized as a three-part architecture including CFS-GA algorithm for attribute selection, C5.0 boost decision tree for classi�cation, and stepwise discriminant analysis for progression, and it is shown in Figure 1.

Attribute Selection of CFS-GA.
Modern medical datasets inevitably contain plenty of redundant and irrelevant attributes. Redundant and irrelevant attributes can lower the efficacy of data mining algorithms, causing uninterpretable results, that is, the Hughes phenomenon [18]. e appropriate subset of attributes can yield an accurate and interpretable result for focusing on the signi�cant attributes objectively in zheng differentiation. erefore, attribute selection is a very important preprocessing step in data mining and analyzing methods. For overcoming the problem of the Hughes phenomenon, attribute subset selection has been used for data reduction in areas characterized by dimensionality due to the large number of available attributes [19]. e CFS-GA algorithm is employed as the attribute selection part in this architecture. CFS (Correlation-based Feature Selection) is a classical �ltered algorithm of attribute selection; in this algorithm, the heuristic evaluation for a single feature corresponding to each category label is used to obtain the �nal feature subset, and the assessment method of CFS is as follows: .
In (1), Ms is the evaluation for an attribute subset s including k attribute items, is the mean correlation degree between attributes and the category label, and is the mean correlation degree among attributes. And the evaluation of CFS is a method of correlation based on attribute subsets. A bigger or smaller in acquired subsets by the method produce a higher evaluation value, and in CFS, the correlation degree among attributes is calculated by information gain, and the formula of information gain is shown below. Y is the category attribute, y is any possible value of Y, the entropy of Y is shown in (2), and for an attribute X, entropy of category attribute Y under the condition of X is in (3). one has , the entropy reduction of attribute Y) can re�ect the information amount provided by attribute X to attribute Y, and a bigger difference means a higher correlation degree between X and Y. Information gain is a symmetrical evaluation method; it tends to select the attributes with more values. erefore, it is necessary to normalize information gain to [0, 1] for keeping equivalent comparison effect among attributes, and (4), below, shows the calculating formula. one has As a �ltering algorithm, CFS evaluates the correlation between attributes and category label, and the redundancy degree among attributes [20]. Although the algorithm performs well in dimension reduction, it cannot approach a global optimum result. e Genetic Algorithm (GA) is a wrapping algorithm in dimension reduction for its global search capability [21][22][23]. In this paper, CFS and GA are combined to make the CFS-GA algorithm, and this algorithm evaluates new individuals of GA through the correlation degree in CFS as the �tness function of GA. e design of the CFS-GA algorithm mainly includes four parts: coding scheme, selection operator, crossover operator, and mutation operator.
In the coding scheme, each entity is encoded with classical binary code. e method of roulette wheel is employed for selection operator. For the crossover operator, single-point crossover is used to produce new individuals by swapping the cross-point part through the crossover points. And basic bit mutation is used in binary encoding for the mutation operator, from 0 to 1, or from 1 to 0.
In the selection of the crossover rate and mutation rate, for producing more new individuals and avoiding causing too much damage to the better attribute subset, the crossover rate range is from 0.40 to 0.99 and the mutation rate is from 0.0001 to 0.1 commonly. e description of CFS-GA algorithm is shown in Algorithm 1.

C��ssi����ion of C�.� Boos� �e�ision Tree.
As a clas-si�cation algorithm, decision trees are always praised for comprehensibility of their knowledge representation and inference procedures [24]. Decision trees have been applied widely in classi�cation, prediction, rules extraction, and other areas to solve the key issues of data classi�cation as an indispensable technology in data mining; they are particularly suitable for the complex principles, processes, and relations found in TCM, and some decision tree algorithms are more representative in broader applications, such as ID3 [25] and C4.5 [26] based on information entropy.
e fundamental idea of a decision tree is to �nd the decisive attributes through a top-down recursive method and depending on proper values to determine the nodes down from the branches and acquire conclusions in the leaf nodes of the tree. e training set needs to be partitioned recursively, until all records of each subset belong to one class, or the predominant majority of each subset belong to one class. So each path from the root to a leaf node corresponds to a conjunctive rule, and the whole decision tree corresponds to a group of extracted rules, and the relevant algorithm is shown in Algorithm 2.
Considering the characteristics of the CHB dataset, clas-si�cation of zheng is based on the method of C5.0 boost decision tree, and it is �t for continuous or nominal attributes in datasets. As the commercial version of C4.5, C5.0 boost decision tree improves the aspects of generating rules and algorithm precision to achieve more accurate generation rules, faster speed, and lower error rate, it is more suitable for classi�cation of large data sets [27].

TCM Zheng Progression of CHB.
e zheng progression part of this logic process architecture is based on discriminant analysis. Discriminant analysis [28] is a method based on an available number of samples classi�ed by a number of clear indicators gained through observation, and it provides a discriminant function based on the indicators for classi�cation. en classify new samples into two types A and B according to the discriminant function to make the lowest mistake classi�cation rate. Methods of discriminant analysis can be divided into the following: Fisher, maximum likelihood, Bayes formula, and gradual selection or full model discriminant analysis [28]. Based on the above methods, in this paper, we use stepwise Fisher's discriminant analysis, and (5) shows the expression below: In (5)  e description of GA-CFS: (1) Initialize the population , and generate attribute subsets randomly; (2) To evaluate the population and calculate the Fitness value of each individual ℎ in the population; (3) While (the optimal result not approached or less than iteration number) { (1) Selection operator, according to Fitness value, select the optimal individual from the parent generation to the next; (2) Crossover operator, according to Fitness value, select attribute subsets by from the parent generation, set the crossover point for each attribute subset, then swap the structures before or aer the point for producing two new individuals; (3) Mutation operator, through the mutation rate and mutation operator, crossover subsets are mutated at random bits to produce two new individuals; Input node , Dataset , partitioning method of (); Output Decision tree on node as its root node, Dataset and partitioning method of (); (1) Procedure Build Tree (2) Initialization root node (3) Find decision feature according to on Dataset (4) If node meets with conditions of partition (5) According to decision feature, Partition Dataset into 1 and 2, and creates two subnodes of , namely 1 and 2; (6) Build Tree ( 1, 1, ); Build Tree ( 2, 2, ); (8) End if (9) End procedure A 2: e general description of decision tree.  Table 1.

3.�. �C� ��en� Cl���i�c�tion of C��.
Accuracy is an evaluation index for classi�cation algorithms. It is calculated as the percentage of the correctly classi�ed samples over all the samples. e C5.0 boost decision tree is used to obtain 13 critical lab indicators in Table 2, and the decision tree induction leads to 12 decisive rules in Algorithm 3 for TCM zheng classi�cation of C�B. An accuracy of �3.�2% has been approached in zheng classi�cation of 550 C�B data records.

Comparison. CFS-GA algorithm in attribute selection
is compared with CFS and nonattribute selection on three classi�cation methods of the NBTree [29], REPTree [30], and C5.0 boost decision tree on this CHB dataset in classi�cation through tenfold cross-validation. Table 3 shows a comparison of REPTree, NBTree, and C5.0 boost decision tree, for the three methods on attribute subsets of CFS, CFS-GA, and nonattribute selection. is table proves that the CFS-GA algorithm generally outperforms CFS and nonattribute selection on the three classi�cation methods, and the combination of CFS-GA and C5.0 boost decision tree performs better than the others. A and B zheng are the former stages, and C zheng is the latter stage. at means the progression of CHB in TCM develops from A or B zheng to C zheng. In order to study the progression among TCM zhengs, we divide the threecategory dataset into two-category datasets, B versus C and A versus C for precise discrimination between syndromes. From this approach, three relevant lab indicators in zheng progression of CHB are �ltered out; they are HBsAg, Eosinophil, and LDL-C, respectively. ese indicators reveal a relatively close relationship in Algorithm 4, and on the three indicators; 57.4% and 55.8% of the records are discriminated correctly.

Differences of Relevant Lab Indicators in TCM Zheng
Progression. In Section 3.4, we have obtained three relevant indicators in zheng alteration in CHB. To observe the differences of the indicators in zheng alteration, we use all the 190 records of all the 70 cases of C zheng, including 73 records of A zheng, 40 records of B zheng, and 77 records of C zheng. e differences of LDL-C, HBsAg, and Eosinophil are shown in Table 4.

Critical Indicators from CFS-GA and C5.0 Decision
Tree. As stated earlier, redundant and irrelevant attributes of datasets will lower the efficacy and performance of data mining algorithms and cause incomprehensible results, so attribute selection is a very important step in the preprocess of sample classi�cation.
In Table 1, 22 lab indicators related to the three TCM zhengs in CHB are �ltered out through CFS-GA algorithm. Based on these indicators, records of the CHB dataset are classi�ed by the C5.0 boost decision tree. e decision tree induction is generalized as 12 critical lab indictors in Table 2 and the decisive rules of Algorithm 3. From Tables 1 and 2, the correlations between some of the indicators with TCM zhengs of A, B, and C in CHB have been proved through medical researches in recent years, such as TBIL, HBsAg, HBcAb-IgM, IgG, albumin, and gamma globulin.
For instance, Zhang et al. [31] detected the liver function indicators of 119 chronic hepatitis B cases and found that TBIL was higher in Damp Heat in the Liver and Gallbladder than in the zhengs of Liver Qi Stagnation and Spleen De�ciency and Yin De�ciency of Liver and Kidney ( 0.05), and it was veri�ed by �iang et al. [32] and Ma and Liu [33] in the later researches. Zhang et al. [34]  together in Liver Qi Stagnation and Spleen De�ciency. In the research of Chang et al. [37], the index of albumin in �in De�ciency of Liver and Kidney is obviously lower than those in other TCM zhengs ( < ), with a rise in Gamma globulin. is was partly veri�ed by the research of Shi [38]. And Shi [38] also discovered that the IgG index in Damp Heat in the Liver and Gallbladder is higher than that in the zhengs of Liver Qi Stagnation and Spleen De�ciency and �in De�ciency of Liver and Kidney with a signi�cant di�erence ( < ). Table 3, attribute selection of CFS-GA or CFS algorithm always performs better than nonattribute selection in the three clas-si�cation algorithms. is proves that attribute selection is an important step before classi�cation and it can improve the accuracy of classi�cation algorithms. Compared to CFS, the CFS-GA algorithm can produce a proper attribute dimension reduction on the CHB dataset, so CFS-GA totally performs better than CFS between the two attribute selection methods. Attribute selection can improve accuracies of classi�cation methods and attribute dimensions. However, a low attribute dimension of data records indicates less information, and it can also in�uence the e�cacy or accuracy of classi�cation methods. In Table 3, although CFS can reduce the attribute dimension of the dataset, CFS has a lower classi�cation accuracy than nonattribute selection for a low attribute dimension in the C5.0 decision tree. zhengs are obtained for the progression of B to C zheng and A to C zheng in Algorithm 4, and according to the related researches [33,34], HBsAg has been proved to be relevant in the three TCM zhengs of CHB. From Table 4, we can see that the mean value of HBsAg has a signi�cant difference between B zheng and C zheng ( * < 0.05), the mean value of LDL-C has a signi�cant difference between A zheng and C zheng ( * * < 0.05), and there is no signi�cant difference of Eosinophil between B zheng and C zheng. It shows that among the three lab indicators in Section 3.4, HBsAg and LDL-C are altered, respectively, in the progressions from B to C and A to C, and the differences of HBsAg in B zheng and C zheng were veri�ed by the former researches [34][35][36].

Conclusions
e present study has classi�ed clinical lab indicator records of TCM zheng in CHB through the attribute subset from CFS-GA algorithm and C5.0 boost decision tree, and stepwise discriminant analysis is used for TCM zheng progression of CHB. It reveals three lab indicators in the progression of the three TCM zhengs in CHB. Among the indicators, there are alterations of HBsAg and LDL-C in the progression of TCM zheng. HBsAg in Liver Qi Stagnation and Spleen De�ciency has a signi�cant difference from that in �in De�ciency of Liver and �idney, and there is a signi�cant difference of LDL-C between Damp Heat in the Liver and Gallbladder zheng and �in De�ciency of Liver and �idney zheng. e proposed approach compares the two decision tree algorithms on attribute subsets from CFS, CFS-GA, and nonselection, respectively. In the comparison, CFS-GA performs better than CFS and nonselection in all the three decision tree methods, and C5.0 boost decision tree performs better than �EPTree and NBTree in classi�cation on CFS, CFS-GA, and nonselection. In future research, we will devote ourselves to optimizing the proposed approach and constructing analysis based on more sample sets.

Con�ict of �nterests
e authors declare that they have no con�ict of interests.