A Real-World Evidence Study for Distribution of Traditional Chinese Medicine Syndrome and Its Elements on Respiratory Disease

Background This study aimed to investigate the distribution and characteristics of traditional Chinese medicine (TCM) syndrome and its elements on respiratory diseases (RDs) based on real-world data (RWD). Methods A real-world study was performed to explore the relationships among TCM syndrome and RDs based on electronic medical information. A total of 26,074 medical records with complete data were available for data analysis. Factor analyses were used to reduce dimensions of TCM syndrome elements and detect common factors. Additionally, cluster analyses were employed to assess combinations of TCM syndrome elements. Finally, association rule analyses were performed to investigate the structures of TCM syndrome elements to estimate the patterns of TCM syndrome. Results A total of 27 TCM syndromes were extracted from RWD in this work. There were four TCM syndromes with >5.0% frequency based on the distribution frequency. The top five pathogenesis TCM syndrome elements were Tan, Huo, Feng, Qi_Xu, and Han. Factor analysis, cluster analysis, and association rule analysis demonstrated that Tan, Huo, Feng, Qi_Xu, Shen, and Fei were the core TCM syndrome elements. Conclusion Four common Shi TCM syndromes on RDs were identified: Tan_Re_Yong_Fei, Tan_Zhuo_Zu_Fei, Feng_Re_Fan_Fei, and Feng_Han_Xi_Fei; two core common Xu TCM syndromes (Fei_Shen_Qi_Xu and Fei_Yin_Xu) and two core common Mix TCM syndromes (Fei_Pi_Qi_Xu-Tan_Shi_Yun_Fei and Fei_Shen_Qi_Xu-Tan_Yu_Zu_Fei) were also determined. The core TCM syndrome elements of Tan, Huo, Feng, Qi_Xu, Shen, and Fei were identified in this work.


Introduction
Traditional Chinese medicine (TCM) has progressively gained wider attention worldwide due to its specific theory and long historical clinical practice [1]. Throughout the world, TCM is increasingly being used for individuals by 10-20% annually [2]. TCM has obvious advantages of extending strength through the invigoration of the body and clearing the root of disease with fewer side effects and multi-target effects [3]. TCM has popularly been used to manage many diseases, such as inflammatory diseases, cancer, and respiratory disease [4][5][6][7][8].
In clinical practice, TCM practitioners form diagnoses and prepare prescriptions mainly on the basis of the pattern of the manifestation of symptoms that vary between each individual, which is known as TCM syndrome (also called "ZHENG") [9]. TCM syndrome is a specific set of symptom or a pattern of symptoms presenting the body's internal and external condition at a certain stage [9]. Generally, it describes the patterns of bodily disharmony according to eight principles. In addition, it also differentiates syndromes according to another system: qi, blood, body-fluid differentiation, and zangfu (organ). TCM syndrome has successfully guided disease research and the prescribing of herbal formulas. Moreover, TCM syndrome, in conjunction with modern medicine diagnosis, is fundamental for diagnosis and treatments in China. Syndrome elements that are gained from four-step validation contribute to syndrome patterns. The 2 Evidence-Based Complementary and Alternative Medicine syndrome elements are used for explaining TCM syndrome patterns and for reflecting the innate pathologic factors [10].
Respiratory diseases (RDs) are the major cause of mortality and morbidity worldwide, and they represent an enormous and increasing healthcare and economic burden, especially asthma, chronic obstructive pulmonary disease (COPD) and interstitial pulmonary disease (IPD) [11]. Modern medicine therapies could not reverse all the symptoms, and the current available drugs may induce prominent side effects such as osteoporosis [12]. Generally, in China, TCM treatment for RDs has a long history, and numerous basic and clinical studies have assured that TCM has curative effects [13,14].
Evidence from a real-world study, outside of randomized clinical trials (RCT), is considered as a way to tailor medical decision-making more closely to the characteristics of individuals for making clinical practice more personalized and effective [15]. Real-world evidence (RWE) is derived from data associated with outcomes from the care of heterogeneous patients as experienced in real-world practice settings. Fortunately, big data methods could effectively manage and treat massive-scale, multiple-source, and heterogeneous realworld data (RWD) [15]. Moreover, data mining or machine learning algorithms, such as factor analysis, cluster analysis, and association rule analysis, could analyze and model the data for RWE. RWE studies will never replace the more robust RCT; however, the emerging trend is to incorporate evidence to be of benefit to medical decision-making. Importantly, it is crucial to perform real-world studies on TCM syndrome to accumulate RWE based on RWD by using big data with data mining or machine learn algorithms.
Currently, studies on TCM syndrome differentiation have been widely developed. However, those pertaining to RDs have not been explored based on RWD through the use of big data methods and data mining algorithms. This study aimed to investigate the distribution and characteristics of TCM syndrome and its elements relating to whole RDs based on real-world datasets.

. . Study Design and Participants.
A real-world study was performed to explore the relationships among TCM syndrome and internal diseases based on electronic medical information including electronic medical records (EMR), hospital information system (HIS), laboratory information system (LIS), and picture archiving and communication system (PACS Inclusion criteria were as follows: diagnoses satisfying the diagnostic criteria of modern medicine for respiratory diseases with detailed medical records, a first final diagnosis of a respiratory disease and subjects being over the age of 18. Exclusion criteria were as follows: unclear diagnoses or diagnoses not satisfying the diagnostic criteria of modern medicine for respiratory diseases, subjects not satisfying the age criteria, and unclear or incomplete medical records.
. . Data Collection and Preparation. The content of medical records for patients was based on EMR and standard Chinese guidelines. The content mainly included general information, complaints, medical histories, modern medicine diagnoses, and TCM diagnosis. Data regarding medical records for patients were transferred from EMR systems of five hospitals and loaded to medical big data platforms of institutes of biomedical informatics and biostatistics and institutes of integrative medicine of Fudan University.
In this work, RDs included chronic obstructive pulmonary disease (COPD), lung infection, chronic bronchitis, lung cancer, acute bronchitis, bronchial asthma, bronchiectasis, acute upper respiratory tract infection, pulmonary tuberculosis, pleural fluid, interstitial lung disease, pleurisy, pulmonary heart disease, pneumoconiosis, pneumothorax, lung abscess, pulmonary encephalopathy and pulmonary embolism. Diagnosis criteria of these RDs were based on the Chinese Society of Respiratory Diseases, the Chinese Medicine Association and its guidelines, and the book Harrison's Principles of Internal Medicine [16,17].
The general information (age, gender, entrance time of hospitalization, duration of hospitalization, and clinical outcomes), TCM syndrome, TCM diagnosis, the first final modern medicine diagnosis, and additional modern medicine diagnoses were extracted by using Python 3.5 programs. Standard common data models (CDM) for RDs of integrative medicine including standard TCM syndrome were created to uniform standard code for data analysis [9]. An integrative medicine data warehouse for RD research was created based on standard CDM and big data platform. A total of 26,074medical records with complete data consisting of general information, TCM syndrome, and TCM and final modern medicine diagnoses were available for future data analysis (Figure 1).
. . Data Analysis. In this work, these main data analyses are conducted to assess the rule of distribution of TCM syndrome and its elements, including the following: (1) frequency distribution of TCM syndrome; (2) frequency distribution of TCM syndrome elements; (3) the combination of TCM syndrome elements based on data mining algorithms; and (4) consistent analysis of TCM syndrome and a combinations of its elements. Differences in variables among subjects grouped by gender were determined by one-way analysis of variance. Among the groups, differences in properties were detected by 2 analysis. Tests were two-sided, and a p-value of < 0.05 was considered significant. Frequency analyses were employed to explore the proportion of RDs and proportion of TCM syndrome for respiratory systems. Elements of TCM syndrome were generated according to standard TCM syndrome and its element guidelines [10]. Moreover, frequency analyses Records collected from medical information system in hospitals (n=30254) Excluded records with the first finial diagnosis not satisfying the diagnostic criteria of respiratory diseases (n=793) Excluded records with unclear the first final diagnosis for respiratory disease (n=1454) Excluded records with unclear or incomplete traditional Chinese medicine syndrome (n=1645) Excluded records with age < 20 years (n=165) and records with the number of diseases belong to respiratory disease <20 (n=123) Records with complete data for next data analysis (n=26074) were performed to assess the proportion of TCM syndrome elements. Results were analyzed using the Statistical Package for Social Sciences for Windows, version 16.0 (SPSS, Chicago, IL, USA). Factor analyses were used to reduce dimensions of TCM syndrome elements and detect the structure among these TCM syndrome elements. The Kaiser-Meyer-Olkin (KMO) test and Bartlett's test of sphericity were used to evaluate suitability of collected TCM syndrome elements for factor analysis [18]. Principal component analyses were applied to extract common factors [18]. Varimax rotation was used to allow the factor load absolute value of the new common factor [18]. In this work, factor load absolute value was larger or equal to 0.20. Cluster analyses were employed to classify TCM syndrome elements. Hierarchical cluster analyses were conducted using Ward's method to generate a dendrogram for estimation of the similar clusters. Cluster boundaries were defined by large distances between successive fusion levels [19].
Additionally, considering the complex network structure for TCM syndrome elements, association rule analyses were performed to investigate the structures of TCM syndrome elements to estimate the distribution of TCM syndrome. A set of frequent rules is generated, and the strength of the rules that was obtained from the first stage is then evaluated [20]. The Apriori algorithms were used to evaluate the pattern of association within TCM syndrome elements. Three parameters of support, confidence and lift were used to assess the strength of rules [20]. Let X be an item set, X=>Y an association rule, and T a set of transactions of a given dataset. Support is an indication of how frequently the item set appears in the dataset, defined as the probability of transactions in T containing X and Y. Confidence is an indication of how often the rule has been found to be true, defined as the conditional probability of having Y given X. Lift is the ratio of the observed support to be expected if X and Y were independent. Lift values of <1, 1, and >1 signify the negative, independent, and positive associations between X and Y, respectively. In this work, rules having a support% value of >10 and confidence% value of >80 were reported. Data mining was performed using the SPSS Modeller (version 18.0, Chicago, IL, USA) and packages in Python 3.5.

Results
. . Characteristics of Individuals. The baseline characteristics of the 26,074individuals are listed in Table 1. In the entire dataset, the proportion of males (n=15350) was 58.87%, and the mean age was 65.70 years. The average duration of hospitalization was 11.86 days. Males had more days of hospitalization than females (12.79 vs. 10.53, P<0.001). The major ethnicity for the dataset was Chinese Han (94.76%). The rate of improvement and being cured for patients was 95.16%.
. . Frequency Analysis of Respiratory Diseases. Distribution of RD in the total sample was listed in Table 2. The main 16 diseases were analyzed in this work. COPD and lung infection were proportionally the two highest diseases in hospitals (32.05% for COPD and 27.81% for lung infection). The proportion of chronic bronchitis and lung cancer was 8.46% and 7.33%, respectively.
. . Frequency Analysis of TCM Syndrome. Distribution of TCM syndrome for RD was listed in Table 3  Note. * entrance from outpatient clinic, * * outcome concerning improved or cured individuals, and * * * difference analyses for variables between male and female.  Chronic bronchitis 2207 8.46 4 L u n g c a n c e r In the entire sample, principal component analysis showed that characteristic root values of the first 10 common factors were greater than 1.0, and their cumulative variance contribution rates reached 78.91. A Scree Plot displayed relevance of common factors and characteristic root values (Figure 2(a)), indicating that the scatter location of the first 10 common factors was steep and that characteristic root values of the rest of the common factors were small. Varimax rotation was used for factor rotation and transformation, and absolute factor load values that were larger or equal to 0.20 were listed in Table 5. Similarly, eight common factors, three common factors, and four common factors were extracted from Shi, Xu, and Mix TCM syndrome groups, respectively . . Cluster Analysis for TCM Elements. In the entire sample, hierarchical cluster analysis signified that these TCM syndrome elements were significantly different among three clusters (Figure 3). Cluster 1 comprised Huo, Tan and Fei, and Cluster 2 comprised Shen, Qi Xu, Han, Yin Xu, and Xue Xu, while Cluster 3 comprised the rest of the TCM syndrome elements. In the Shi TCM syndrome group, three clusters were identified, suggesting that Cluster 1 included Tan, Fei, and Huo, Cluster 2 included Feng, Han, and Shui Ting, and the rest of the TCM syndrome elements belonged to Cluster 3. In the Xu TCM syndrome group, two clusters were generated, indicating that Cluster 1 consisted of Fei, Shen, Qi Xu, and Yin Xu and that Cluster 2 consisted of Pi and Xue Xu. In the Mix TCM syndrome groups, Cluster 1 included Huo, Yin Xu, Shen, and Xue Yu, while the rest of the TCM syndrome belonged to Cluster 2.

. . Association Rule Analysis for TCM Elements.
In the entire dataset, four rules satisfying rule algorithms were listed in Table 6. The strongest support% parameter (60.907) was between Tan and Fei. Three rules concerning Huo => Fei, Huo => Fei, and Huo and Tan => Fei had the strongest confidence% value (100). The association rule analysis informed the combinations among Tan, Huo, Feng, Qi Xu, and Fei (Table 6). In the Shi TCM syndrome group, five rules had been identified, suggesting the combinations among Tan, Huo, Feng and Fei. Seven rules were reported in the Xu TCM syndrome groups, indicating the combinations among Qi Xu, Yin Xu, Shen, and Fei. In the Mix TCM syndrome group, the combinations among Qi Xu, Tan, and Fei were reported. In general, association rule analysis demonstrated that Tan, Huo, Feng, Qi Xu, Shen, and Fei were the core TCM syndrome elements.

Discussion
We conducted a real-world study to investigate the distribution and characteristics of TCM syndrome and its elements on RDs based on RWD. RCTs are the gold standard in the generation of medical evidence; however, they are being challenged by enrollment criteria, timelines and atypical comparators [15]. RWE are seen as complementary evidence generated from RCTs. RWE studies are increasingly becoming the normal practice in ensuring its significance in clinical practice. More importantly, big data methods have the advantage of managing and collecting massivescale, real-world medical data [15]. Furthermore, the methods could effectively transform, extract and treat the multiplesources and heterogeneous RWD to a standard structured dataset. In this work, standard CDM data was used for data analysis. Factor analysis, cluster analysis and association rule analysis were used to explore common factors and evaluate the combined pattern for TCM syndrome with its elements. To our best knowledge, this is the first real-world study in China to investigate distribution and characteristics of TCM syndrome and its elements based on RWD by using big data methods and data mining or machine learning algorithms. In addition, this study also provided valuable experience in real-world study for clinical research on TCM. In this study, the two most frequent RDs were COPD (32.05%) and lung infection (27.81%).Our findings indicated that four common syndromes in decreasing order of frequency were Tan Re Yong Fei, Tan Zhuo Zu Fei, Feng Re Fan Fei, and Feng Han Xi Fei. Frequency analyses showed the four Shi TCM syndromes with >5.0% frequency. In the Shi TCM syndrome group, factor analyses also indicated the four common syndromes for RDS. Cluster analyses and association rule analyses presented the combinations among Tan, Huo, Feng, Han, and Fei, suggesting that the four TCM syndromes were the common ones for RDs. Similarly, in the Xu TCM syndrome group, Fei Shen Qi Xu and Fei Yin Xu were two core common TCM syndromes, and among the Mix TCM syndromes, Fei Pi Qi Xu-Tan Shi Yun Fei was the most common. Additionally, frequency analyses, factor analyses, cluster analyses, and association rule analyses for TCM syndrome 8 Evidence-Based Complementary and Alternative Medicine  elements demonstrated that the predominant elements in the pathogenesis of RD were Tan, Huo, Feng, and Qi Xu. The main disease location is Fei and Shen. Currently, a standard TCM syndrome differentiation of some RDs has not yet been established, such as COPD. In general, the standard TCM syndrome differentiation was just primarily established from the literature analysis and expert counseling, and there is limited evidence for clinical application. In this study, data collected from 26,074medical case records reflected the clinical common syndromes of RDs, which provided powerful evidence for a standard TCM syndrome differentiation. In TCM, syndrome pattern is the principle for treatment, and more attention should then be paid to the accuracy of syndrome classification. In this study, the abovementioned eight syndrome patterns may provide a guideline for differentiation and treatment for RDs.
Syndrome elements are the smallest unit of syndrome patterns explaining complexity and flexibility of TCM syndrome differentiation and reflecting the innate pathologic factors [10]. The real-world study demonstrated that lungs were more easily attacked by Tan, Huo, and Feng. Pathogenic Feng (wind) is the leading pathological cause of all diseases. When the defense system is weak and the protective Qi loses its ability to protect the body from foreign pathogens, which are often attached with other morbidity factors, such foreign pathogens invade the human body [21]. Phlegm syndrome appears in almost all RDs. For instance, frequency of phlegm in COPD is almost 60% [22]. The pathogenic factors may disrupt the function of lungs, block Qi, and blood circulation, influence the function of organs, and cause numerous negative changes in the body. Thus, phlegm is the pathological product of the disturbance of body fluid in the transportation and stagnation of body fluids [23]. Moreover, phlegm syndrome is almost always accompanied by other syndromes, such as phlegm-damp, phlegm-heat and phlegm-stasis syndromes [12,24]. Consistent with the previous findings, these syndrome elements, Tan (damp), Huo (heat), Feng (stasis.) and Shui Ting, appeared frequently in RDs in the present study.
Although Shi syndromes had the highest distribution frequency, Xu syndromes appear throughout the whole course of disease. Deficiency in origin mainly included lung and Qi deficiency; as the disease continued to advance, Yin and Pi deficiency appeared. In a long course, Shen was involved and insufficiently presented. If it brought about Yin deficiency of Shen, it was not nourishing Gan wood and finally led to Gan and Shen deficiency and prosperity of fire [25]. When the syndrome was accurately differentiated and distinguished, the primary and secondary symptoms were taken into account. For example, a clinical investigation of Evidence-Based Complementary and Alternative Medicine 9 Note. Support % > 10% and confidence% > 80%.
2,500 adult asthma cases identified the common ZHENG with primary and secondary symptoms [24]. Interestingly, these phenomena were in accordance with the analysis results of syndromes and elements frequency distribution. In spite of the profound significance of this study, there are some limitations of this study. Firstly, the present study was based on real-world study design, so selection bias could not be avoided. The multiple-center and massive-scale cases have been collected that can reduce the bias. Secondly, the multiple-sources and heterogeneous medical information could make data analysis difficultly. Fortunately, big data and data mining or machine learning algorithms could effectively transform, extract and treat them to generate standard CDM data. Thirdly, this work just explored the distribution of TCM syndrome and its elements on the whole level for RDs. In near future, those works for each disease concerning respiratory system should be performed separately. All of the limitations underscored the importance of this study, and the results of this study made a significant contribution to the standardization of RD syndromes and treatment.

Conclusion
In summary, the distribution of TCM syndromes and its elements are identified through a real-world study. The overall 20 syndrome elements of whole RDs are combined into eight main syndromes, including Shi, Xu. and Mix syndromes. The significant five syndrome elements, Tan, Huo, Feng, Qi Xu Shen, and Fei, are the basis of syndrome differentiation for RDs and created a bridge to the standardization of RD syndromes. Additionally, this study demonstrated the close correlation of some elements.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Additional Points
Items interpretation is as follows: Fei: the lung; Pi: the pleen; Shen: the kidney; Xin: the heart; Nao: the brain; Shi: the excess; Xu: the deficiency; Huo: the fire; Tan: the phlegm; Feng: the wind.

Ethical Approval
This study was approved by the Committee of Huashan Hospital, Shanghai, China. The methods were carried out in accordance with the approved guidelines.

Disclosure
Fei Xu, Qing Kong, Zihui Tang, and Jingcheng Dong were from the Department of Integrative Medicine, Huashan Hospital, and Institutes of Integrative Medicine, Fudan University, Shanghai, China; Wengqiang Cui was from the Department of Integrative Medicine and Neurobiology, School of