Pulse Diagnosis Signals Analysis of Fatty Liver Disease and Cirrhosis Patients by Using Machine Learning

Objective. To compare the signals of pulse diagnosis of fatty liver disease (FLD) patients and cirrhosis patients. Methods. After collecting the pulse waves of patients with fatty liver disease, cirrhosis patients, and healthy volunteers, we do pretreatment and parameters extracting based on harmonic fitting, modeling, and identification by unsupervised learning Principal Component Analysis (PCA) and supervised learning Least squares Regression (LS) and Least Absolute Shrinkage and Selection Operator (LASSO) with cross-validation step by step for analysis. Results. There is significant difference between the pulse diagnosis signals of healthy volunteers and patients with FLD and cirrhosis, and the result was confirmed by 3 analysis methods. The identification accuracy of the 1st principal component is about 75% without any classification formation by PCA, and supervised learning's accuracy (LS and LASSO) was even more than 93% when 7 parameters were used and was 84% when only 2 parameters were used. Conclusion. The method we built in this study based on the combination of unsupervised learning PCA and supervised learning LS and LASSO might offer some confidence for the realization of computer-aided diagnosis by pulse diagnosis in TCM. In addition, this study might offer some important evidence for the science of pulse diagnosis in TCM clinical diagnosis.


Introduction
Pulse diagnosis had played an important role in clinical diagnosis and therapeutic evaluation of TCM for several thousand years. Modern researches of pulse diagnosis based on the modern technology, such as the signal analysis, are very important for the development of TCM.
Machine learning builds empirical models on data for analysis and forecasting, which has recently been used for TCM data analysis [1], especially for the diagnosis data of TCM. Some research about modeling and symptom selection for multilabel data in the inquiry diagnosis of coronary heart disease (CHD) [1,2] and multiclass support vector machines in lip diagnosis [3] offered some available new methods in modern research of TCM. Pulse diagnosis is more difficult in standardization researches. Our work team had done some work about it, including data analysis for pregnant women, animals with High Blood Pressure and Heart Failure, rats after nephrectomy, and patients with CHD, HBP, and so forth [4][5][6][7]. The results are encouraging.
Supervised learning and unsupervised learning are the primary methods of machine learning. In the research of pulse diagnosis in TCM, it is very difficult to collect large number of samples with high quality. Accordingly, in this study, we combined unsupervised (PCA) with supervised methods (LS and LASSO) to analyze the signals collected from patients with different diseases and healthy volunteers for cross-reference to achieve reliable results in identification by signals of pulse diagnosis in TCM.
Fatty liver disease (FLD) and cirrhosis are both common liver diseases in clinic with high incidence. FLD is generally 2 The Scientific World Journal described as the build-up of fat in the liver cells. The prevalence of nonalcoholic fatty liver disease (NAFLD) ranges from 9 to 36.9% of the population in different parts of the world [8][9][10]; even in army, the incidence of nonalcoholic fatty liver disease is about 17.1% in navy flight crew and submariners [11].
Cirrhosis is a result of advanced liver disease. It is characterized by replacement of liver tissue by fibrosis and regenerative nodules. Cirrhosis is most commonly caused by alcoholism, hepatitis B and hepatitis C, and fatty liver disease. Cirrhosis is a leading cause of death in the world. In Europe, 95,609 males and 53,123 females died of cirrhosis in 2002, with large differences in age adjusted death rates among the different European geographical areas [12]. Complications such as ascites, esophageal variceal bleeding, hepatic encephalopathy, and hepatorenal syndrome are the main cause of death in this kind of disease.
Traditional Chinese Medicine plays an important role in the treatment of the two diseases, and pulse diagnosis can help clinic doctors during the diagnosis and the treatment, including prescription and evaluation. Experienced doctors can feel the difference between patients and healthy people just by pulse feeling; someone can even separate cirrhosis patients from FLD patients, but we get no evidence.
In this study, we would like to analyze the pulse signals collected by pulse-collecting instrument, and 3 groups of people are collected: healthy volunteers, patients with FLD, and patients with cirrhosis. Supervised learning and unsupervised learning are used in this study, and the results are encouraging.

Objects
We collected the pulse waves of 100 healthy volunteers in the graduated students' institute of China Academy of Chinese Medical Sciences and Tsinghua University. 50 patients with FLD were collected in China Academy of Chinese Medical Sciences and 50 patients with cirrhosis were collected in Guang'anmen Hospital of China Academy of Chinese Medical Sciences from December 2012 to July 2013. All the volunteers and patients were asked to fill the questionnaires. According to the quality of the signals and the integrity of their information we chose 98 cases from the healthy volunteers, 38 cases from the patients with FLD, and 27 cases from the patients with cirrhosis.

Signal Collection.
Volunteers were asked to sit and keep silence to adjust their breath for 15 min firstly; then we collect the pulse waves in 3 places of both left and right sides of the radial artery called "cun," "guan," and "chi" in TCM ( Figure 1 shows the details) for 40 s by the instrument called "Collection and Analysis System of Pulse Diagnosis Signals in TCM (patent number of pulse-collecting instrument: This study had been demonstrated by the ethics committee of Experimental Research Center of China Academy  of Chinese Medical Sciences, and each of the patients and volunteers had read and signed the informed consent.    When the values of = 0 + ∑ =1 ( cos 2 / + sin 2 / ) arrive minimum, the difference of models and the original signals also be the minimum. And and can carry the information of all the cycles. For any cycle whose period length is , we can use

Pretreatment and Parameters
to be the model of pulse waves, so and are the parameters we need, and the formula is In fact we cannot construct a figure of a cycle even if we have all the characters, but we can use and to construct one by the formula (the fitting result is shown in Figures 5 and 6).

Parameters Extracting.
To build models for the pulse waves collected from 6 places (3 places in radial artery of each side) called "cun," "guan," and "chi" in TCM and account for the amplitude and phase of 12 harmonics (C1-C12, F1-F12), 9 time domain parameters [13] were added such as ℎ1, 1, ℎ4/ℎ1, ℎ5/ℎ1, and ( Figure 7 shows the details). Because the period of waves from each place is similar, we can get 193 parameters from every volunteer: 32 parameters × 6 places + 1.  [14,15] is a method of multivariate statistical analysis to show the reason of variance of data by linear combination of the parameters. In this study we have extracted 193 parameters from pulse diagnosis signals collected from the 6 places and decomposed all the data to different orthogonal components by PCA, which means reducing the high-dimensional data to several orthogonal independent one-dimensional arrays ( Figure 8 shows the details). The first principal component reflects the greatest change direction.

Least Squares Regression (LS).
The purpose of classification and identifications is to establish a method to distinguish two or more groups of known data, and this method can be used to identify some new data. We do it as follows: (a) Design a one-dimensional vector as the optimal regressand in the meaning of canonical correlation and then do Least squares Regression:   x 1 x 1 x 2 x 2 z 1 z 2 Figure 8: Diagram of data decomposition in PCA.

Least Absolute Shrinkage and Selection Operator (LASSO). LASSO [16] is a new variable-choosing method created by Tibshirani in 1996 [17] on the basis of Bridge
Regression by Frank and Friedman [18] and Nonnegative Garrote by Breiman [19]. The algorithm is summarized as follows. Suppose is the coefficient of the model, the corresponding function is ( ), and is a -dimensional vector. The equation of parameters penalty is. When ( ) = ( − ) 2 , (| |) = | | , that is, the Bridge Regression. When = 1, that is the LASSO (Figure 9 shows the details). In this study, we use LASSO with cross-validation to choose the regressors. It usually selects fewer numbers of the regressors and trade-offs between bias and mean squared error. So it may increase the accuracy of the model for coming new data.

Comprehensive Comparison and Analysis.
We compare the results by the 3 methods. According to the result of PCA, the unsupervised learning, we can make sure if there is innate difference between the groups. The result of supervised learning will help the features selection to mine the most important features during the classification. Based on the cross-validation of 3 methods in data analysis, a reliable conclusion can be given in pulse diagnosis signals analysis.

Results
We have used 163 samples in this research (98 healthy volunteers, 38 patients with FLD, and 27 patients with cirrhosis). The results were based on the 193 parameters extracted from the signals in 3 places of each side of the radial artery called "cun," "guan," and "chi" in TCM by PCA, LS, and LASSO.

Signal Analysis between Healthy Volunteers and
Patients with FLD 4.1.1. Principal Component Analysis (PCA). By using the unsupervised learning without any information to guide the classification, we found that there is obvious difference between the pulse waves of healthy volunteers and patients with FLD. The accuracy to classify the signals of the two groups only by the 1st principal component is 83% ( Figure 10 shows the details). The result suggests that it is feasible to The Scientific World Journal 5   separate these two groups without any supervising due to physiological changes.

Supervised Learning: LS and LASSO.
According to the character of clinical data, small samples with large dimensionality, we built a program to avoid the false classification in LS method: we test our method by 23 simulated samples to decide the upper limit number of selected regressors according to the number of samples. In this study, we can use 7 regressors at most.
(a) Results by LS. Doing the analysis by LS, we found that it is very easy to classify the two groups of signals and the accuracy is 91% by 7 parameters and 82% by only 2 parameters. The most important parameters we mined by a method we built named EFBLS (Extended Forward Backward Least Square Regression) mainly appeared at zuocun, youguan, and youchi ( Figure 11 and Table 1 show the details).

Comprehensive Comparison and Analysis.
Comparing the results from the 3 methods, based on the combination of unsupervised learning and supervised learning, we can make a conclusion that there is obvious difference between the pulse signals of healthy volunteers and FLD patients, and the accuracy of classification is about 85%. The features we mined were mainly focused on zuocun, youguan, and youchi. In the theory of pulse diagnosis of TCM, youguan always represents the function of digestive. If some problem happened on digestive, doctors can feel the pulse in youguan changed. In this study, the result was partly matched with the theory of TCM. However, we need more data to confirm it.

6
The Scientific World Journal

Principal Component Analysis (PCA)
. By using the unsupervised learning without any information to guide the classification, we found that there is obvious difference between the pulse waves of healthy volunteers and patients with cirrhosis. The accuracy to classify the signals of the two groups only by the 1st principal component is 72% ( Figure 13 shows the details).

Supervised Learning: LS and LASSO.
In this study, we can use up to 7 regressors.
(a) Results by LS. Doing the analysis by LS, we found that it is very easy to classify the two groups of signals and the accuracy is 93% by 7 parameters and 84% by only 2 parameters. The most important parameters we mined by a method we built named EFBLS (Extended Forward Backward Least Square Regression) mainly appeared at zuoguan, zuochi, and youguan ( Figure 14 and Table 2 show the details).    : Classification by LASSO. The accuracy that we cannot classify the signals of the 2 groups by 2 parameters is 0.28, which means that the accuracy to classify this 2 groups is 72%, and the green samples (healthy volunteers) mainly appeared in the left side of the red point and the blue ones (patients with cirrhosis) appeared in the right side. signals. The accuracy to classify is 72% by 2 parameters and the equation of the model is as follows: ( Figure 15 shows the details).

Comprehensive Comparison and Analysis.
Comparing the results from the 3 methods, based on the combination of unsupervised learning and supervised learning, we can make a conclusion that there are differences between the pulse signals of healthy volunteers and cirrhosis patients, and the accuracy of classification is about 75%. The features we mined were mainly focused on zuoguan. In the theory of pulse diagnosis of TCM, zuoguan always represents the function of liver and gallbladder. If some problem happened on liver and gallbladder, doctors can feel the pulse in youguan changed. In this study, the result was matched with the theory of TCM.
The Scientific World Journal   (PCA). By using the unsupervised learning without any information to guide the classification, we found that there is obvious difference between the pulse waves of healthy volunteers and patients with cirrhosis. The accuracy to classify the signals of the two groups only by the 4th principal component is 78% ( Figure 16 shows the details).

Supervised Learning: LS and LASSO.
In this study, we can use up to 7 regressors.
(a) Results by LS. Doing the analysis by LS, we found that it is very easy to classify the two groups of signals and the accuracy is 91% by 7 parameters and 73% by only 2 parameters. The most important parameters we mined by a method we built named EFBLS (Extended Forward Backward Least Square Regression) mainly appeared at zuocun, youguan, and youchi ( Figure 17 and Table 3 show the details).   : Classification by LS. The accuracy that we cannot classify the signals of the 2 groups by only 2 parameters is 0.27, which means that the accuracy to classify this 2 groups is 73%, and the green samples (patients with FLD) mainly appeared in the left side of the red point and the blue ones (patients with cirrhosis) appeared in the right side.  The accuracy that we cannot classify the signals of the 2 groups by 2 parameters is 0.28, which means that the accuracy to classify this 2 groups is 72%, and the green samples (patients with FLD) mainly appeared in the left side of the red point and the blue ones (patients with cirrhosis) appeared in the right side.

Comprehensive Comparison and Analysis.
Comparing the results from the cross-validation of the 3 methods, we can make a conclusion that there are differences between the pulse signals of FLD patients and cirrhosis patients, and the accuracy of classification is about 70%. The features we mined were mainly focused on youguan and youchi. This result is not simply matched with the theory of TCM mentioned in Section 4.2.3; zuoguan always represents the function of liver and gallbladder. As we know that FLD and cirrhosis are two liver diseases in different stage and the pathologic changes are much more serious in not only liver but also blood vessels and digestive system. So the features appear on other point instead of zuoguan.

Conclusions
There is a significant difference between the pulse diagnosis signals of healthy volunteers and patients with FLD and 8 The Scientific World Journal cirrhosis, and the result was confirmed by 3 analysis methods. The identification accuracy of the 1st principal component is about 75% without any classification formation by PCA, and supervised learning's accuracy (LS and LASSO) was even more than 93% when 7 parameters were used and 84% when only 2 parameters were used. From the results, we can have some conclusions.
(1) The machine learning method we built based on the combination of unsupervised learning, PCA, and supervised learning, LS and LASSO, is feasible in analyzing the pulse diagnosis signals. Moreover, according to the result of crossreference by 3 methods and the equation established by LASSO, we can achieve a reliable result by signals of pulse diagnosis in TCM to identify the healthy volunteers and the patients. This method can help offer some objective data to prove the important role of pulse diagnosis in TCM.
(2) The features we mined by LS and LASSO to classify the healthy volunteers and patients with FLD and cirrhosis appear in specific places we called "cun," "guan," and "chi" when feeling the pulse. For example, the features to classify the healthy volunteers and patients with FLD mainly appear in zuocun, youguan, and youchi. The features to classify the healthy volunteers and patients with cirrhosis mainly appear in zuoguan and zuochi. However, youguan and youchi are the main place where features appear between the patients of FLD and cirrhosis patients. This result is similar to the theory of pulse diagnosis in Traditional Chinese Medicine (TCM) which can support the modern research of pulse diagnosis. But more research is needed to confirm the conclusion.
(3) FLD and cirrhosis are the most common liver diseases, and there are high incidences among middle-aged men. These two kinds of diseases have not only common points in the pathology but also different points. Pulse diagnosis of TCM can diagnose disease by feeling the pulse of radial artery, but we have no evidence to prove that. In this study, according to the result, we can find that from the pulse wave collected in different places called "cun," "guan," and "chi" in radial artery humans in different health condition can be classified with high accuracy even when the diseases affected the same organ, liver. From the results, we can offer some important evidence for the science of pulse diagnosis in TCM clinical diagnosis. Of course, we need more data to confirm it.
In a word, the machine learning method we built in this study based on the combination of unsupervised learning, PCA, and supervised learning, LS and LASSO, might offer some confidence for the realization of computer-aided diagnosis by pulse diagnosis in TCM. In addition, this study might offer some important evidence for the science of pulse diagnosis in TCM clinical diagnosis.