Intelligent ZHENG Classification of Hypertension Depending on ML-kNN and Information Fusion

Hypertension is one of the major causes of heart cerebrovascular diseases. With a good accumulation of hypertension clinical data on hand, research on hypertension's ZHENG differentiation is an important and attractive topic, as Traditional Chinese Medicine (TCM) lies primarily in “treatment based on ZHENG differentiation.” From the view of data mining, ZHENG differentiation is modeled as a classification problem. In this paper, ML-kNN—a multilabel learning model—is used as the classification model for hypertension. Feature-level information fusion is also used for further utilization of all information. Experiment results show that ML-kNN can model the hypertension's ZHENG differentiation well. Information fusion helps improve models' performance.


Introduction
Hypertension is one of the major causes of heart cerebrovascular diseases. 25%-35% adults over the world have hypertension. There are over 972 million hypertension patients, of which 60%-70% are over 70 years old [1,2]. With the fast development of electronic medical record (EMR) system, there exists a good accumulation of clinical cases about hypertension. As diagnostic knowledge and herb formula of Traditional Chinese Medicine (TCM) are mostly distilled from clinical practice, researches on these clinical cases may help promote the understanding toward TCM theory, make progress on the development of diagnosis technology, and also contribute to the objection and modernization of TCM.
ZHENG, also translated as syndrome, in TCM means a characteristic profile of all clinical manifestations that can be identified by a TCM practitioner. TCM lies primarily in "treatment based on ZHENG differentiation" [3]. Only after successful differentiation of ZHENG, can effective treatment of TCM be possible [4]. Traditionally, techniques of ZHENG differentiation are learned by successors of a particular TCM practitioner only and learning effect is always confined to the successors' personal talents. With the unprecedented growth of clinical data, this way is no longer proper, which makes it difficult to discover new knowledge from the data mountain. Data mining is a distinguished technology to track the underlying information. Many research works have been dedicated to TCM data mining [5][6][7], all of which indicate a promising future for auto differentiation of ZHENG in TCM.
In the field of data mining, differentiation of ZHENG is modeled as a classification problem. For traditional classification methods, every instance should have one and only one label. However, TCM diagnostic result usually consists of several ZHENG. In other words, one patient could have more than one ZHENG. Professionally, it is called multilabel data, the learning of which is a rather hot topic recently in the fields of data mining and machine learning. International workshops about multilabel learning are held in the recent three years, respectively, to promote the development of this topic [8,9]. Multilabel learning has been applied to TCM by Liu et al. [7], who compared the performance of ML-kNN and kNN on a coronary heart disease dataset. Li [11], respectively, to improve multilabel classification's performance on a coronary heart disease dataset.
One characteristic of TCM ZHENG differentiation is "fusion use of four classical diagnostic methods." Inspection, auscultation and olfaction, inquiry and palpation are the four classical diagnostic methods in TCM. How to use information from these four diagnostic methods to make better ZHENG differentiation is an important research area in TCM field. Some theories of Traditional Chinese Medicine diagnosis even claim that only by using information from all the four classical diagnostic methods can we differentiate correctly the ZHENG [4]. And "fusion use of the four classical diagnostic methods" is treated as an important direction in computerization of TCM diagnosis [12]. In fact, it is called information fusion in the field of data mining. Therefore, fusion of information from different sources should be considered seriously in building ZHENG classification with multilabel learning techniques. Nowadays, no researchers have tried to bring techniques of information fusion into the field of multilabel learning. Wang et al. have done some work in TCM information fusion using traditional single-label methods, which mainly focus on the data acquisition and medical analysis on experiment results [12,13]. But as described above, multilabel learning should be more appropriate for ZHENG classification. So more attention should be paid on the research of information fusion for multilabel learning.
In this paper, we try to build TCM ZHENG classification models on hypertension data using multilabel learning and information fusion. The rest of the paper is arranged as follows. Section 2 describes materials and methods, including the data source, data preprocessing, feature-level information fusion, and ML-kNN. Experimental results and discussions are shown in Section 3. Finally Section 4 draws conclusions on this paper. features, including 143 TCM symptoms from inspection, auscultation and olfaction, inquiry and palpation, and 5 common indexes including gender, age, hypertension duration, SBPmax, and DBPmax, are investigated and collected in this database. It also stores the 13 labels (TCM ZHENG) of each case. Academic and noncommercial users may access it at http://levis.tongji.edu.cn/datasets/index en.jsp.

Data Preprocessing.
According to the theory of TCM, the characteristics of the LEVIS Hypertension TCM Database, and our research target that evaluation of the performance of multilabel classification model on datasets with information from particular diagnostic methods only (we call them single-diagnosis datasets later) and on dataset with fusional information of all diagnostic methods (called fusionaldiagnosis dataset), five single-diagnosis datasets are retrieved from the LEVIS Hypertension TCM Database. The information contained in each datasets is shown in Tables 1, 2, 3, 4, and 5, which comes, respectively, from inspection diagnosis, tongue diagnosis, inquiry diagnosis, palpation diagnosis, and other diagnoses. Analyzing the 775 cases, 4 cases are found to have empty value in one of the features mentioned above in the five tables. Thus, these 4 cases are removed from all  the five single-diagnosis datasets to ensure smooth progress of the following tasks: information fusion and classification model building.
In the above data sets, we find some labels appear rarely, which will severely hurt severely performance of classification methods. We randomly choose part of the data set in this work. Firstly, labels are selected to decrease the degree of imbalance. In this case, we chose labels 6, 10, and 12, as they have the largest number of positive cases and multilabel method should predict at least 3 labels simultaneously. Secondly, cases are selected that are marked negative on all the selected labels to be the pending removable set, so that the entire positive cases in any label are preserved. Finally, randomly remove some cases from the pending removable set to decrease imbalance. Here, 500 cases are put into the pending removable set and 100 cases are selected from the set to form one dataset with remaining cases each time. So finally, we get five datasets and the performance of our model is evaluated according to the average performance on all datasets. The final used data set may be downloaded from: http://levis.tongji.edu.cn/datasets/htn-ecam.zip.

Feature-Level Information Fusion.
In this work, we only discuss information fusion on the level of feature [14,15]. Let A = {a 1 , a 2 , . . . , a n }, B = {b 1 , b 2 , . . . , b m }, C, D, E denote, respectively, the 5 feature vectors with different dimensions illustrated in Tables 1-5. The target is to combine these five feature sets in order to yield a new feature vector, Z, which would better represent the individual or help build better classification model [14]. Specifically, information fusion is accomplished by simply augmenting the information (feature) obtained from multiple diagnostic methods. The vector Z is generated by augmenting vectors A to B, C, D, and E one after the other. The concrete stages are described below:  . . . , a n , b 1 , . . . , b m , . . . , e 1 , . . . , e l }.

Multilabel Learning: ML-kNN.
As illustrated in Section 1, multilabel learning model is believed to be more suitable classification model for TCM clinical data. Specifically, we constructed models of the relationship between symptoms and ZHENG by means of the multilabel knearest neighbor (ML-kNN) algorithm [16] in this study. ML-kNN is a lazy multilabel learning algorithm developed on the basis of kNN algorithm, which regards an instance as a point in synthesis space. kNN's idea is to search for k training instances nearest to the testing instance, and then predict the label of the test instance according to the  nearest instances' labels. Compared with other algorithms, advantage of kNN lies in its simpler training process, better efficiency, and competitive performance. Based on the theory of kNN, ML-kNN also aims to find k nearest instances for each test instance. But rather than judging labels directly by nearest instances, ML-kNN utilizes the "maximum a posteriori estimation" principle to determine the label set based on statistical information derived from the label sets of neighboring instances. The concrete steps are demonstrated below [7]: (1) calculate the conditional probability distribution of each instance associated to each label; (2) calculate the distance between the x i test instance and the training instances; then find k nearest instances for x i . Repeat for each test instance; (3) according to the labels of k training instances and the conditional probability associated to each label, forecast the probability of the x i instance and then acquire the forecast results (≥0.5 is taken here); Repeat for each test instance; (4) evaluate the forecast results according to multilabel evaluation criteria.  Table 6 summarizes the experimental results on the five single-diagnosis datasets and the one fusional-diagnosis dataset. All the seven evaluation criterions are configured to be the bigger the better, even for negative number (the closer to zero, the better). From the Table 6, we can find the following.

Experimental Results and Discussions.
(1) The model built on inspection-diagnosis dataset performs the best in all the evaluation criterions, among the 5 models built on single-diagnosis datasets, which demonstrates that inspection may be the best way to differentiate ZHENG about hypertension.
(2) For all evaluation criterions, performance of fusional-diagnosis model is the best, which may prove strongly the TCM theory that "fusion use of the four classical diagnostic methods" is essential and help improve the accuracy of ZHENG differentiation.

Conclusions
In this paper, we attempted to use feature-level information fusion technique and ML-kNN algorithm to improve Evidence-Based Complementary and Alternative Medicine 5 performance of intelligent ZHENG classification, which is a tough but essential task in TCM. Instead of using traditional learning methods, according to the characteristics of TCM clinical cases, a popular multilabel learning method, ML-kNN, is used as the classification model. Information fusion to properly combine information from different diagnostic methods is used to improve classification performance, which confirms the TCM theory of "comprehensive analysis of data gained by four diagnostic methods." In future, we will continue this study to solve the imbalance in the data set and try model level information fusion.