A New Method for Syndrome Classification of Non-Small-Cell Lung Cancer Based on Data of Tongue and Pulse with Machine Learning

Objective To explore the data characteristics of tongue and pulse of non-small-cell lung cancer with Qi deficiency syndrome and Yin deficiency syndrome, establish syndrome classification model based on data of tongue and pulse by using machine learning methods, and evaluate the feasibility of syndrome classification based on data of tongue and pulse. Methods We collected tongue and pulse of non-small-cell lung cancer patients with Qi deficiency syndrome (n = 163), patients with Yin deficiency syndrome (n = 174), and healthy controls (n = 185) using intelligent tongue diagnosis analysis instrument and pulse diagnosis analysis instrument, respectively. We described the characteristics and examined the correlation of data of tongue and pulse. Four machine learning methods, namely, random forest, logistic regression, support vector machine, and neural network, were used to establish the classification models based on symptom, tongue and pulse, and symptom and tongue and pulse, respectively. Results Significant difference indices of tongue diagnosis between Qi deficiency syndrome and Yin deficiency syndrome were TB-a, TB-S, TB-Cr, TC-a, TC-S, TC-Cr, perAll, and the tongue coating texture indices including TC-CON, TC-ASM, TC-MEAN, and TC-ENT. Significant difference indices of pulse diagnosis were t4 and t5. The classification performance of each model based on different datasets was as follows: tongue and pulse < symptom < symptom and tongue and pulse. The neural network model had a better classification performance for symptom and tongue and pulse datasets, with an area under the ROC curves and accuracy rate which were 0.9401 and 0.8806. Conclusions It was feasible to use tongue data and pulse data as one of the objective diagnostic basis in Qi deficiency syndrome and Yin deficiency syndrome of non-small-cell lung cancer.


Introduction
Lung cancer is a common malignant tumor of the lung and is a major cause of morbidity and mortality. It is estimated that the number of deaths from lung cancer accounts for about 24% of all cancer deaths in the United States [1,2]. An organization report shows that lung cancer causes approximately 1.76 million deaths worldwide each year, accounting for 18.7% of all cancer deaths [3]. Non-small-cell lung cancer (NSCLC) is the most common histological type of lung can-cer, accounting for more than 80% of primary lung cancers [4]. Sixty percent of NSCLC cases have metastasized at the time of diagnosis. The 5-year survival rate for advanced NSCLC is lower than 5%, and early diagnosis of lung cancer is an important opportunity to reduce mortality [5,6]. The current treatment methods for NSCLC mainly include surgery, radiotherapy, chemotherapy, and targeted therapy [7,8]. Chemotherapy is the most common treatment. However, patients with poor health often have a low tolerance to conventional treatment with a tendency of drug resistance [9].
Traditional Chinese medicine (TCM) has a long history and rich experience in the treatment of lung cancer, which is one of the main methods of comprehensive treatment of lung cancer in China. Systematic evaluation of TCM shows that TCM combined with radiotherapy and chemotherapy and targeted therapy had certain advantages in alleviating symptoms, stabilizing tumors, improving life quality, and prolonging survival period [10]. TCM has been proved to be an effective method for the treatment of advanced lung cancer. On the basis of accurate syndrome differentiation, TCM plays an active role in each stage of the occurrence and development of lung cancer [11,12].
Syndrome differentiation and treatment is the basic principle of TCM to diagnose and deal with diseases. It is a process of comprehensive judgment on the four types of diagnostic information of patients based on the theory of TCM combined with the doctor's experience [13]. Accurate syndrome differentiation is able to provide a basis for the treatment of diseases and is the foundation of clinical efficacy. Traditional syndrome differentiation and treatment inevitably suffer from subjectivity and ambiguity, which actually hinders the development of TCM. Microsyndrome differentiation is a method of using modern advanced technology to go deep into the body's microcosmic level to understand and differentiate syndromes on the basis of macroscopic syndrome differentiation. Microsyndrome differentiation can be used to guide disease differentiation and syndrome differentiation, explore the cause and pathogenesis, and evaluate the efficacy and guide the prognosis of the disease [14]. Previous studies have verified that there is a close relationship between different syndromes and physical and chemical indices. A combination of microindex and macrosymptom can assist syndrome differentiation effectively.
With the rapid development of modern research on tongue and pulse diagnoses, a variety of tongue and pulse diagnoses instruments are widely used in clinical practice. This has generated a large number of objective data of tongue and pulse diagnoses, which are also microscopic indices in a sense. In recent years, studies based on data of tongue and pulse diagnoses have been increasing, with many researchers applying machine learning and data mining methods to the fields of image recognition, target detection, natural language processing, and others [15][16][17][18]. In addition, studies have demonstrated that accurate detection, identification, and multidimensional quantitative analysis based on tongue data and pulse data have been gradually applied to disease diagnosis. By constructing the diagnostic relationship between tongue and pulse and health status, it not only saves medical resources but also greatly improves diagnosis efficiency and treatment [19][20][21][22]. Qi deficiency syndrome and Yin deficiency syndrome are the two main common syndromes of NSCLC. When the symptoms are not obvious, the traditional symptom-based syndrome differentiation cannot be carried out. The modern study of tongue and pulse diagnoses research provides a good data basis for TCM syndrome differentiation.
Tongue data and pulse data are the most representative data of four diagnoses of TCM. The data collected and ana-lyzed under the standardized condition has a high level of stability, which provides reliable objective data for intelligent syndrome differentiation. Among all kinds of syndromes, tongue and pulse are related to some extent, but the traditional syndrome differentiation cannot be clearly explained due to the lack of accurate data. With the development of diagnosis technology, the analysis and interpretation of the relationship between tongue and pulse can be realized more clearly. In this study, two common syndromes of NSCLC were selected to explore the differences of tongue data and pulse data and quantitatively analyze the data correlation of tongue and pulse, using machine learning methods to establish syndrome classification models based on macrosymptom, objective tongue and pulse, and macrosymptom and objective tongue and pulse, and evaluate the contribution rate of the objective data of tongue and pulse to syndrome differentiation.

Study Design and Subjects.
We selected a total of 337 patients from the oncology department of Yueyang Hospital of Integrated Traditional Chinese and Western Medicine from January 2018 to October 2020, including 163 patients with Qi deficiency syndrome and 174 patients with Yin deficiency syndrome. All patients were pathologically or cytologically confirmed to be NSCLC. We additionally selected a total of 184 healthy people from Shuguang Hospital of Shanghai University of Traditional Chinese Medicine from January 2018 to October 2020 as the healthy controls. The flowchart is shown in Figure 1.

Diagnostic
Criteria. Diagnostic criteria of Western medicine: according to the clinical practice guidelines for lung cancer screening issued by the National Comprehensive Cancer Network (NCCN) [23] and the fourth edition lung cancer histological classification standards of "Classification of Lung Tumors" [24,25] issued by the World Health Organization.
TCM Syndrome Differentiation Standard: according to the "Technical Guidelines for Clinical Research of New Drugs of Syndromes" [26] and the Syndrome Part of TCM Clinical Diagnosis and Treatment Terms [27] and textbooks of Common Diseases and Symptoms in Internal Medicine of Traditional Chinese Medicine.
The main manifestations of Qi deficiency syndrome are cough, white or foamy phlegm, small amount of hemoptysis, chest tightness, shortness of breath, low fever, spontaneous sweating, lack of energy, pale complexion, poor appetite, loose stools, pale red tongue with tooth marks, thin white coating, and thin pulse. The main manifestations of Yin deficiency syndrome are cough without phlegm, or less but sticky phlegm, phlegm with blood, shortness of breath and dull chest pain, low fever, dry mouth, night sweat, upset and insomnia, red tongue, little or bare without tongue coating, and thin and rapid pulse. The syndrome was determined by at least three senior physicians to ensure the consistency and authenticity of syndrome differentiation.

2
BioMed Research International 2.3. Inclusion and Exclusion Criteria. The inclusion criteria are as follows: (1) meeting the above diagnostic criteria, (2) confirmed by pathology or cytology, (3) no serious liver or kidney damage, and (4) know and sign informed consent. The exclusion criteria are as follows: (1) those who did not meet the inclusion criteria for NSCLC,(2) patients with Qi deficiency syndrome combined with Yin deficiency syndrome, (3) patients with severe primary diseases such as cardiovascular, cerebrovascular, liver, kidney, and blood system,(4) pregnant or lactating women, (5) psychopath, and (6) patients who were unable to cooperate with research work due to subjective and objective reasons and who had poor compliance.

2.4.
Collecting Clinical Data of Tongue and Pulse. We used TFDA-1 digital tongue diagnosis instrument and PDA-1 digital pulse diagnosis instrument developed by the National Key Research and Development Program to collect tongue and pulse diagnostic data of patients, respectively. We used the Information Record Form of TCM Clinical Four Diagnostics (Copyright No.: 2016Z11L025702) developed by our research group to record the symptoms of patients [28]. All the work of tongue and pulse diagnoses collection and inquiry were completed by professional personnel of TCM or integrated TCM and western medicine who had received standardized training. Each patient was consulted by at least two professional researchers, and the syndromes of all patients were judged by three senior doctors to ensure the consistency and authenticity of data collection and interpretation and minimize deviation.
TFDA-1 digital tongue diagnosis instrument and tongue diagnosis analysis system (TDAS v2.0) are shown in Figures 2 and 3. The tongue was imaged by a video camera (Nikon 1 J5) with a fixed-focal lens which has 12 megapixels, and the picture resolution is 5568 * 3712. TFDA-1 digital tongue diagnosis instrument uses LED light sources, and a curved reflector is set in front of the light sources to ensure the uniformity of illumination in all parts when the tongue image is collected. The color rendering index of light source is 96, and color temperature is around 5,000-6,500 K. Parameters of the TFDA-1 digital tongue diagnosis instrument are  PDA-1 digital pulse diagnosis instrument and its corresponding sphygmogram are shown in Figure 4. The PDA-1 pulse diagnosis instrument uses a pressure sensor. Place the probe at the guan place of the patient's left hand, fix the strap, and adjust the tightness of the strap so that the sphygmogram reaches the best peak (the peak value of the main sphygmogram is 2 grids and above). Collect 30 s after the waveform is stable.
Tongue indices can be divided into two categories: tongue body (TB) index and tongue coating (TC) index which mainly come from the three color spaces of Lab, HIS, and YCrCb [29][30][31][32]. Each parameter of tongue diagnosis and pulse diagnosis has its corresponding medical significance [32][33][34]. In tongue indices, they are R (Red), G (Green), B (Blue), H (Hue), S(Saturation), I (Intensity) and L (Light), a (red-green axis), b (yellow-blue axis), Y (brightness), Cr (difference between red signal and brightness), Cb (difference between blue signal and brightness), texture indices include CON (Contrast), ASM (Angular Second Moment), ENT (Entropy), MEAN (Mean), and tongue coating indices include perAll and perPart. perAll represents the ratio of coated tongue area to total tongue area, and perPart represents the ratio of coated tongue area to noncoated tongue   BioMed Research International area. In pulse indices, h 1 -h 5 mainly represent the amplitude height. h 1 is the main wave amplitude, h 3 is heavy wave front wave amplitude, h 3 /h 1 is the ratio of heavy wave front wave amplitude to the amplitude of the main wave, h 4 is the dicrotic notch amplitude, h 4 /h 1 is the ratio of the dicrotic notch amplitude to the amplitude of the main wave, h 5 is the gravity wave amplitude, and h 5 /h 1 is the ratio of gravity wave amplitude to the amplitude to the amplitude of the main wave. t represents a complete pulse cycle, and t 1 is the time value from the start point to the crest point of the main wave on the sphygmogram. t 4 is the time value from the start point to the dicrotic notch on the sphygmogram, and t 5 is the time value from the dicrotic notch to the end point on the sphygmogram. w 1 is the width at 1/3 of the main wave, and w 2 is the width at 1/5 of the main wave. All the tongue and pulse indices are extracted by special tongue analysis software (TDAS v2.0) and pulse analysis software (PulseCol).

Statistical
Analysis. SPSS 26.0 was used for statistical analysis. Categorical variables were expressed as percentages (%). Continuous variables were expressed as mean ± standard deviation (SD) for those with normal distribution or median (interquartile range) for those with skewed distribution. Continuous variables were compared with analysis of variance (ANOVA) or rank-sum test (Kruskal-Wallis H test), and the correlation heat map was made by GraphPad Prism 8.0. A two-sided P value < 0.05 was considered statistically significant.
2.6. Classification by Machine Learning Approach. We used four machine learning methods, namely, neural network, random forest, support vector machine (SVM), and logistic regression to set the ratio of training set to test set at 8 : 2 using Orange (3.26.0) software. We used adjusted parameters of each model to establish classification and diagnosis models of Qi deficiency syndrome and Yin deficiency syndrome of NSCLC based on "symptom," "tongue and pulse," and "symptom and tongue and pulse", respectively. We used accuracy, precision, F1-score (F1), sensitivity, specificity, and area under the curve (AUC) as evaluation indices to evaluate the predictive performance. AUC was the area under the ROC curve. The larger the value, the better the classification effect of the classifier. The calculation formula of each index was as follows: In the above statements, True Positive (TP) was the positive sample predicted by the model as the positive category. True Negative (TN) was the negative sample predicted by the model as the negative category. False Positives (FP) was the negative sample predicted by the model as the positive category. False Negative (FN) was the positive sample predicted by the model as the negative category.

Characteristics of Participants.
The basic statistical analysis result of the three groups is shown in Table 1.
The result showed that people with Qi deficiency syndrome and Yin deficiency syndrome had a statistically  5 BioMed Research International significantly higher age than healthy controls. However, there was no difference in age between people with Qi deficiency syndrome and Yin deficiency syndrome.

Statistical
Analysis of Tongue Data. Statistical analysis result of tongue diagnosis data in the three groups is shown in Table 2.
The result showed that (1) compared with Qi deficiency syndrome, there were more significant differences between Yin deficiency syndrome and the healthy controls. (2) In the significant difference indices between Yin deficiency syndrome and healthy controls, except for the texture index of tongue coating, the changes of tongue body index of Yin deficiency syndrome were more significant than that of tongue coating index. (3) Significant difference tongue indices between Qi deficiency syndrome and Yin deficiency syndrome were TB-a, TB-S, TB-Cr, TC-a, TC-S, TC-Cr, perAll, and TC-CON, TC-ASM, TC-MEAN, and TC-ENT; among them, TB-a, TB-Cr, TC-a, TC-S, TC-Cr, and TC-ASM of Yin deficiency syndrome were higher than those of Qi  3.3. Statistical Analysis of Pulse Data. Statistical analysis result of pulse diagnosis data in the three groups is shown in Table 3. The result showed that (1) the pulse parameters t 1 , t 4 , t 5 , h 1 , h 3 , h 4 , h 5 , h 1 /t 1 , h 4 /h 1 , t 4 /t 5 , w 1 /t, and w 2 /t of Qi deficiency syndrome and Yin deficiency syndrome had statistical significance compared with those of healthy controls. (2) Only two parameters, t 4 and t 5 , showed statistically significant differences between Qi deficiency syndrome and Yin deficiency syndrome.

Correlation Analysis of Tongue Data and Pulse Data.
Tongue data and pulse data were statistically significantly correlated among people with Qi deficiency syndrome and Yin deficiency syndrome ( Figure 5 and Table 4). Heat map result of Qi deficiency syndrome is shown in Figure 5.
Correlation analysis result of tongue data and pulse data between Qi deficiency syndrome is shown in Table 4.
The result showed that (1) there was a strong correlation between the tongue coating texture parameters, and the color space parameters of the tongue coating and the tongue body were also correlated. The correlation between the tongue coating texture parameters and the color space parameters was weaker than the correlation of the pulse parameters. (2) There was a definite correlation between pulse parameters t 4 and tongue parameters TC-ASM, TC-ENT, and TC-MEAN, with a correlation coefficient of -0.18, 0.18, and 0.18, respectively. (3) There was a weak correlation between t 5 and TB-Cr with a correlation coefficient of -0.16 (P < 0:05).
The heat map result of Yin deficiency syndrome is shown in Figure 6.
The correlation analysis result of tongue data and pulse data between Yin deficiency syndrome is shown in Table 5.
The result showed that (1) similar to Qi deficiency syndrome, the tongue coating texture parameters of Yin deficiency syndrome had a strong correlation, and the color space parameters of the tongue coating and tongue body were also strongly correlated. The correlation between tongue coating texture parameters and color space parameters was weaker than that of pulse parameters. (2) There was a certain correlation between pulse parameters t 4 and tongue parameters TC-ASM and TC-a. Both of the correlation coefficients were -0.14, but the difference was not statistically significant (P > 0:05). (3) t 5 was strongly correlated with TB-a, TC-S, TC-Cr, and TB-a, and the correlation coefficients were -0.33, -0.27, -0.23, and -0.23, respectively (P < 0:01). The correlation coefficients of t 5 with TB-Cr, TB-S, and TC-ASM were -0.21, -0.20, and -0.20, respectively (P < 0:01).
The correlation analysis result showed that correlation intensity of tongue and pulse in Yin deficiency syndrome was significantly stronger than that in Qi deficiency syndrome, and compared with Qi deficiency syndrome, the correlation between t 4 and tongue indices in Yin deficiency syndrome was significantly reduced, while the correlation between t 5 and tongue indices was significantly increased.

Machine Learning
Results. Based on neural network, random forest, SVM, and logistic regression four machine learning methods, the modeling result of Qi deficiency syndrome and Yin deficiency syndrome based on symptom, tongue and pulse, and symptom and tongue and pulse is shown in Table 6. Table 3: Statistical analysis of pulse diagnosis data (mean (SD), median (P 25 , P 75 )).

BioMed Research International
The ROC curves of the models based on symptom, tongue and pulse, and symptom and tongue and pulse are shown in Figures 7-9, respectively. According to the above modeling results, the classification efficiency of each model based on different datasets had the following order: tongue and pulse < symptom < symptom and tongue and pulse. Among them, the SVM model had a better classification performance for symptom datasets, and the area under the ROC curve was 0.9321. The logistic regression model had a better classification performance for tongue and pulse datasets, with an area under the ROC curve of 0.9401. The neural network model had a better classification performance for the symptom and tongue and pulse datasets, with an area under the ROC curve of 0.9401.

Discussion
Treatment based on syndrome differentiation is the basic principle of TCM to recognize and treat diseases. It runs through the whole process of prevention and rehabilitation of medical care practices. Syndrome differentiation is used to recognize the disease and determine the syndrome, and treatment is to establish treatment methods and prescription drugs based on the results of syndrome differentiation. Syndrome differentiation is the prerequisite and basis for treatment. Accurate syndrome differentiation results in a good therapeutic effect. Qi deficiency syndrome and Yin deficiency syndrome are two common syndromes in TCM. According to the basic theory of TCM syndrome differentiation, Qi deficiency syndrome refers to the lack of vitality of the body and

11
BioMed Research International lung cancer. Compared with chemotherapy and radiotherapy, it has the advantages of availability, effectiveness, and low toxicity [35], although its various mechanisms deserve further study [36,37]. In this study, the tongue parameters, including TB-a, TC-a, TB-Cr, and TC-Cr of Qi deficiency syndrome and Yin deficiency syndrome, represent the red value of tongue body and tongue coating. Larger values of tongue parameters reflect the redder or magenta tongue. In Yin deficiency syndrome, TB-a, TC-a, TB-Cr, and TC-Cr were all higher than those in Qi deficiency syndrome, indicating that the tongue of Yin deficiency syndrome was redder or magenta. S stands for saturation, and the higher the value of S, the brighter the tongue color will be. TC-S in Yin deficiency syndrome was higher than that in Qi deficiency syndrome, indicating that the tongue color of Yin deficiency syndrome was brighter. perAll is the ratio of tongue coating area to total tongue area. perAll has a higher diagnostic value for thick coating, and the higher the value, the thicker the tongue coating. perAll in Yin deficiency syndrome was lower than that in Qi deficiency syndrome, indicating that the tongue coating was thinner in Yin deficiency syndrome. Among the four parameters of texture parameters CON, ASM, ENT, and MEAN, the smaller the value of CON, ENT, and MEAN, the larger the ASM, reflecting that the more delicate the tongue texture or the more greasy the tongue coating. In this study, TC-CON and TC-ENT of Yin deficiency syndrome were significantly lower than those of Qi deficiency syndrome, while TC-ASM was higher than that of Qi deficiency syndrome, indicating that the tongue coating of Yin deficiency syndrome was greasier.
In the pulse parameters, t 4 is the time value from the starting point to the descending isthmus of the sphygmogram, corresponding to the systolic period of left ventricle, and t 5 is the time value from the dicrotic notch to the end point of the sphygmogram, corresponding to the diastolic period of left ventricle. t 4 and t 5 of Yin deficiency syndrome were smaller than those of Qi deficiency syndrome, indicating that the time of systole and diastole of Yin deficiency syndrome was shorter than those of Qi deficiency syndrome, and the pulsation cycle t of Yin deficiency syndrome also showed a decreasing trend, indicating that the pulse wave velocity of Yin deficiency syndrome was slightly higher. In addition, there was a phenomenon of elevation of Yin deficiency syndrome in dicrotic notch h 4 . Furthermore, indicrotic notch h 4 in Yin deficiency syndrome was elevated. In the Qi deficiency syndrome, h 3 /h 1 , h 1 /t 1 , and t 1 were prolonged, reflecting that the pulse force of the Qi deficiency syndrome was soft and weak, the amplitude of the main wave h 1 was reduced, and the area under the sphygmogram was smaller, indicating that the pulse shape was thin and small. All in all, the tongue of Qi deficiency syndrome was pale and the pulse was weak, while the tongue body of Yin deficiency syndrome was more red or crimson, more brighter in tongue color, thinner and greasy in tongue coating, and more fine in pulse.

Modeling Analysis of Qi Deficiency Syndrome and Yin
Deficiency Syndrome Based on Data of Tongue and Pulse. In recent years, with the rapid development of computer technology, different recognition algorithms and machine learning methods, such as logical regression [38], SVM [22,39], random forest [40], and neural network [15,41], and other data mining technologies have been widely used in medical research. The quantitative diagnosis of diagnostic information through various mathematical models has promoted the development of TCM informatization. In this study, symptom and tongue and pulse data were used to classify syndromes. The results showed that the classification efficiency of models based on different datasets was as follows: tongue and pulse < symptom < symptom and tongue and pulse, indicating that tongue and pulse data contributed to the classification of syndrome to some extent. Therefore, when faced with a complicated quantitative and qualitative, subjective and objective, determine and fuzzy, and massive TCM data combining linear and nonlinear, TCM syndrome associated with complex multidimensional characteristics and associated with multiple microindex, especially when symptoms were not evident, to explore the relationship between different syndromes and physical and chemical indices can effectively assist in syndrome differentiation. Research also showed that it was very reasonable to combine microindex with macrosymptoms. Using machine learning or data mining methods to build TCM syndrome or disease diagnosis model can make the process of syndrome differentiation and treatment more objective, standardized, and intelligent [42][43][44].

Limitations and Future
Work. This research is based on the real-world investigation, and the results basically conform to the syndrome distribution feature of NSCLC in the clinic. However, there are also some limitations in the study. First of all, due to the limitation of time and place, the sample size of this study is not large enough. Secondly, the basic data statistics of subjects are not comprehensive enough, and there is a lack of statistics on height, weight, body mass index (BMI), history of present illness, past medical history, etc., which may affect the data results. Last but not the least, this study mainly focused on the common NSCLC syndrome of Qi deficiency and Yin deficiency, lacking more syndromes to explore. In the future, a large-scale and multicenter epidemiological investigation should be combined, the collection of four diagnostic information and basic characteristics needs to be more standardized and complete, and further researches based on more comprehensive syndrome differentiation results need to be carried out.

Conclusions
In conclusion, objective tongue and pulse data of NSCLC are useful for the classification of TCM syndrome, which can improve the accuracy of TCM syndrome classification to a certain extent. Tongue and pulse diagnosis parameters can provide new ideas and methods for TCM syndrome differentiation of Qi deficiency syndrome and Yin deficiency syndrome of NSCLC. 12 BioMed Research International