Natural Language Processing Algorithms for Normalizing Expressions of Synonymous Symptoms in Traditional Chinese Medicine

Background The modernization of traditional Chinese medicine (TCM) demands systematic data mining using medical records. However, this process is hindered by the fact that many TCM symptoms have the same meaning but different literal expressions (i.e., TCM synonymous symptoms). This problem can be solved by using natural language processing algorithms to construct a high-quality TCM symptom normalization model for normalizing TCM synonymous symptoms to unified literal expressions. Methods Four types of TCM symptom normalization models, based on natural language processing, were constructed to find a high-quality one: (1) a text sequence generation model based on a bidirectional long short-term memory (Bi-LSTM) neural network with an encoder-decoder structure; (2) a text classification model based on a Bi-LSTM neural network and sigmoid function; (3) a text sequence generation model based on bidirectional encoder representation from transformers (BERT) with sequence-to-sequence training method of unified language model (BERT-UniLM); (4) a text classification model based on BERT and sigmoid function (BERT-Classification). The performance of the models was compared using four metrics: accuracy, recall, precision, and F1-score. Results The BERT-Classification model outperformed the models based on Bi-LSTM and BERT-UniLM with respect to the four metrics. Conclusions The BERT-Classification model has superior performance in normalizing expressions of TCM synonymous symptoms.


Introduction
Traditional Chinese medicine (TCM) symptoms are recorded by TCM practitioners who sometimes use different words when recording the same symptoms, as a consequence of their diverse experience and educational background. ese variations in words lead to the phenomenon known as "one symptom with different literal expressions," which is prevalent in TCM medical records. Wang et al. [1] reported that approximately 80% of TCM symptoms were recorded with multiple expressions. Although the literal expressions of these symptoms are different, they have the same meaning, and their use does not affect understanding.
us, the use of these alternative symptoms does not affect the pathogenesis diagnosis. In summary, TCM symptoms that have the same meaning but different literal descriptions are known as TCM synonymous symptoms. For example, the symptom "lack of appetite" (纳减) can also be expressed as "loss of appetite" (纳差) or "decreased appetite" (食欲减 低). ey all mean a reduced desire to eat and are used in the description of spleen Qi deficiency (脾气虚).
It is essential to explore and analyze TCM medical records for the purpose of TCM modernization [2,3]. However, the abundance of synonymous symptoms in TCM medical records hinders systematic scientific knowledge discovery. Referring to the TCM terminology [4] published by relevant authorities, it is possible to establish a TCM thesaurus and then normalize each symptom in TCM medical records to a symptom that has the same meaning in the thesaurus, so that TCM synonymous symptoms would have uniform literal expressions.
at is, TCM symptom normalization is a feasible method for handling TCM synonymous symptoms. However, manual TCM symptom normalization is time-consuming and labor-intensive because of the large and growing quantity of TCM electronic medical records.
Natural language processing (NLP), which has experienced extraordinary development in recent years, provides valuable support for the automatic processing of text data, such as language translation [5], question answering [6], and information processing of medical texts [7][8][9][10]. is success suggests that the NLP technology will be effective for normalizing the expression of TCM synonymous symptoms.
In previous work, researchers have proposed some NLPbased normalization models for biomedical fields, such as Word2Vec [11], Jaccard similarity [12], DNorm [13], and BERT-based ranking [14] from the perspective of similarity matching. In addition, from the perspective of named entity recognition (NER), there are transition-based [15] models and Bi-LSTM-CNNs-CRF [16]. Although the performance of these models is satisfactory according to the published reports, there are two problems that are worthy of further exploration, from the perspectives of their applicability to normalizing TCM symptoms and the modeling concepts of the NLP model: (1) With regard to applicability, the above models are used for normalizing multiple synonymous terms to one term. However, they are not suitable for cases in which synonymous symptoms correspond to multiple normalized symptoms. For example, "less white sputum and difficult to expectorate" (痰少色白难 咳) and "less white phlegm and not easy to expectorate" (少量白痰且不易咳出) are synonymous symptoms, should be normalized to "less phlegm" (痰少), "white phlegm" (痰白), and "expectoration difficulties" (痰难咳出). (2) With regard to the modeling concept, approaches from the perspectives of similarity matching and NER have been reported. However, many models constructed from the perspectives of sequence generation and text classification have also shown excellent performance and applicability in NLP tasks [17,18]. erefore, it is necessary to explore the applicability of sequence generation and text classification to this normalization task and investigate whether better performance can be achieved.
According to the above statement, the objective of this study is to develop normalization models for normalizing the expressions of TCM synonymous symptoms from the perspective of sequence generation and text classification and to compare and analyze the applicability and performance of the models, so as to select the best model.

Methods
e workflow of this study is shown in Figure 1. It can be divided into three parts: (1) collecting TCM symptoms from medical records (referred to as sample collection), (2) preparing training, development, and test data sets (referred to as division of data sets), and (3) constructing models for normalizing expressions of TCM synonymous symptoms (referred to as model construction).

Data Sources and
Labeling. In total, 3,252 medical records, recorded by 22 TCM doctors on the platform of the "Heritage Program of Chinese Well-Known Experts" [19], were collected. e symptoms in the medical records were regarded as the original symptoms, each of which was then labeled by the corresponding normalized symptom, according to the TCM esaurus (from the Beijing University of Chinese Medicine TCM Information Science Research Center). Two researchers, who had obtained the qualification of TCM practicing physician and been trained by the provider of the TCM esaurus, performed the labeling work. Two additional experts in the TCM esaurus checked the labeling results independently, and inconsistent labeling results were submitted to a third expert for review and discussion to ensure consistency.
ere are two forms of original symptoms in medical records: single symptoms and complex symptoms. A single symptom is an original symptom that corresponds to only one clinical manifestation; such a symptom was labeled as one normalized symptom by referring to the TCM esaurus. For example, "thinning and shapeless stool" was labeled as "loose stool." A complex symptom is an original symptom that corresponds to multiple clinical manifestations; such a symptom was labeled as multiple normalized symptoms. For example, "dry and itchy throat" was labeled as "dry throat" and "itchy throat" by referring to the TCM esaurus.
In total, 16,808 nonrepetitive original symptoms were collected from the 3,252 medical records, corresponding to 1,501 normalized symptoms, of which 339 appeared only once. e collected original symptoms and labeled normalized symptoms served as the input and output data, respectively, of TCM symptom normalization models.

Partition of Data Sets.
Two strategies were used to divide the collected data into training, development, and test data sets. e first strategy was to divide the medical records by source doctors randomly. e nonrepetitive original symptoms recorded by one randomly selected doctor, and the corresponding normalized symptoms were used as a development set to set the parameters of the model. e nonrepetitive original symptoms recorded by another 2 Evidence-Based Complementary and Alternative Medicine randomly selected doctor, and the corresponding normalized symptoms were used as a test set to observe the ability of the model to normalize the expression of TCM symptoms. e nonrepetitive original symptoms recorded by the 20 other doctors, and the corresponding normalized symptoms were used as the training set. ese data sets were called total data sets (TDS). is data set division is suitable for evaluating the performance of the TCM symptom normalization models in practical applications. e second strategy for dividing the collected data into training, development, and test data sets was based on highfrequency normalized symptoms. ese data sets were called high-frequency data sets (HFDS). According to Zipf's law [20], N � −1 + �������� 1 + 8 × I 1 /2 (N is the threshold between high-frequency and low-frequency, and I 1 is the number of normalized symptoms that only appeared once). Normalized symptoms with a frequency greater than 26 were defined as high-frequency normalized symptoms. e highfrequency normalized symptoms and the corresponding original symptoms were included in the HFDS. e ten most frequent normalized symptoms and their corresponding numbers of original symptoms are shown in Figure 2. In the HFDS, 70% of the data (6,768 original symptoms and the corresponding normalized symptoms) were randomly selected as a training set, 15% (1,471 original symptoms and the corresponding normalized symptoms) were as a development set, and 15% (1,425 original symptoms and the corresponding normalized symptoms) were as a test set. e numbers of samples in HFDS and TDS are shown in Table 1.

Model Construction.
From the perspective of text sequence generation, the bidirectional long short-term memory (Bi-LSTM) recurrent neural network (RNN) with the encoder-decoder structure [21], combined with the Luong attention mechanism [22], was used to establish four models for TCM symptom normalization. (1) Encoder (Char)-Decoder (Char) model: the input of the original symptom and the output of the normalized symptom were in character form (multiple output normalized symptoms were separated by ",").    Evidence-Based Complementary and Alternative Medicine (Char)-Decoder (Label) model: the input of the original symptom was in character form, and the output of the normalized symptom was in label form. (4) Encoder (Word)-Decoder (Label) model: the input of the original symptom was in word form, and the output of the normalized symptom was in label form. e structure of the four models was consistent; only the input and output forms were different, as shown in Figure 3(a). is study also applied the Bi-LSTM and a full connection layer with sigmoid function to explore the feasibility of TCM symptom normalization from the perspective of text classification. In this case, the model output was in label form, and the input was in character or word form (see Figure 3(b)). In the Encoder (Char)-Classification model, the input was in character form; in the Encoder (Word)-Classification model, the input was in word form. e words that were input to the model were obtained from the original symptoms by a segmentation tool [23].
Chinese language pretraining weights, trained on a large number of Chinese corpora, can help achieve better results. erefore, this study further used the unified language model (UniLM) based on the Chinese pretraining weights of bidirectional encoder representation from transformers (BERT) [18,24] to construct the TCM symptom normalization model. e training process included first loading the Chinese pretraining weights of BERT (https://storage. googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip) and then training with the sequence-to-sequence method of UniLM [18]. is training method was based on text sequence generation. Two output forms were used in training: a character-based output form, namely the BERT-UniLM (Char) model, and a label-based output form, namely the BERT-UniLM (Label) model, as shown in Figure 4(a). BERT and a full connection layer with sigmoid function were also used to construct the TCM symptom normalization model, namely the BERT-Classification model, as shown in Figure 4(b). Because the input of the pretraining weights of BERT was in character form, the input of the BERT-based models was also in character form.

Model Parameters.
e encoder-decoder models had initialization weights sampled from a random uniform distribution in the range of −0.05-0.05, the dimension of embedding was 300, and the training batch size was 256. Adam was the optimizer [25]. According to the F1-score of the encoder-decoder models on the development set, the best parameter combinations were selected for learning rate (selected from 0.0001, 0.0003, and 0.0005), dropout rate (selected from 0.3 and 0.5), and the number of memory cells (selected from 128, 256, and 512).
For the encoder-classification models, the training batch size was 256. According to the F1-score of the models on the development set, the best parameter combinations were selected for learning rate (selected from 0.005, 0.01, and 0.03), dropout rate (selected from 0.3 and 0.5), and the number of memory cells (selected from 128, 256, and 512).
For the BERT-UniLM and BERT-Classification models, the training batch size was 16, the optimizer was Adam [25], and the learning rate was 0.0003. e other parameters were the default settings of the BERT neural network [24]. e TensorFlow neural network framework (http:// www.tensorflow.org/), developed by Google, was used to implement the above models and was combined with NVIDIA GeForce RTX 2080 (11 GB memory) to train the models. When the F1-score of the models in the development set had not improved for 20 epochs, the training was terminated. Even if a fixed random seed number was used, the results from different computers were still biased. erefore, after setting the model parameters, the modeling process was repeated 10 times; the model performance was evaluated by four metrics and expressed as mean ± standard deviation (SD). e four metrics used were accuracy, precision, recall, and F1-score. Accuracy � P/T; Here, P (the correct normalized symptoms of model prediction) is the number of all correct results output by the model, and T (total correct normalized symptoms corresponding to the test set) is the number of all tests. TP (true positive) is the number of results produced by the model that were consistent with the actual results, FN (false negative) is the number of correct results that the model failed to output, and FP (false positive) is the number of results produced by the model that were incorrect. e key model parameters and development set results are shown in Tables 2 and 3.

Statistical
Analysis. IBM SPSS 20.0 was used to analyze the results. When analyzing the indexes of each group, if the variance between groups was homogeneous and normal distribution was satisfied, one-way ANOVA was used. If the variance was not homogeneous or there was non-normal distribution among groups, the Kruskal-Wallis test was used.

Performance of Models on Test Data Sets.
Generally, the performance of models was better on the HFDS test data set than on TDS. With regard to the model structure, the BERT-UniLM models had more advantages than the Encoder-Decoder models, as shown in Tables 4 and 5. In addition, comparing the BERT-UniLM models with the BERT-Classification model, the BERT-Classification model had more advantages.
at is, the BERT-Classification model was the best model for normalizing expressions of TCM synonymous symptoms in this study, on both the HFDS and TDS test data sets. e performance of three classification models with different threshold values on HFDS and TDS was explored. With regard to HFDS, when the threshold value was 0.2, the performance of both BERT-Classification and Encoder-Classification was generally the best, as shown in Figure 5. With regard to TDS, the best threshold value was 0.1, as shown in Figure 6. When comparing the BERT-Classification model with the Encoder-Classification models, the BERT-Classification model achieved better results. e accuracy and F1-score were 0.9051 and 0.9073 on the HFDS and 0.8568 and 0.8574 on the TDS, respectively. e classification-based models have the ability to adjust the output threshold to change the recall. We believe this capability can be used for the retrieval of normalized symptoms. Because retrieval focuses on higher recall, namely, focuses on the outputs contain the correct normalized symptoms. By lowering the output threshold, the models can output the top 5 and 10 normalized symptoms above the threshold. erefore, the retrieval ability was evaluated by the top 5 and 10 recall, and the results are shown in Table 6.

Performance of Models in Normalizing Single and Complex Symptoms.
In evaluating the various models for normalizing single symptoms (the original symptoms corresponding to one normalized symptom) and complex symptoms (the original symptoms corresponding to multiple normalized symptoms), we found that the performance of the BERT-Classification model was comprehensively superior, not only on HFDS but also on TDS, as shown in Figures 7 and 8.

Comparison with Other Normalization Models.
We also compared the BERT-Classification model with several other models that perform well for normalization, including the state-of-the-art models reported by other researchers. ese methods are the Jaccard similarity algorithm [12], Word2Vec with cosine [11], DNorm [13], the transition-based model [15], RNN-CNNs-CRF [16], and BERT-based ranking [14]. e above models were not designed for the normalization of complex symptoms. erefore, we only compared the performance of models to handle single symptoms (4,555 single symptoms) taken from the HFDS. e 4,555 single symptoms, and their corresponding normalized symptoms, were divided into a training set (70%), a development set (15%), and a test set (15%). e development set was used to select the parameters of each model, except the Jaccard method, for which there is no need to select parameters.
e test results showed that the BERT-Classification model performed better than the other methods, as shown in Table 7.
We note that Jaccard similarity, Word2Vec with cosine, DNorm, and BERT-based ranking can output the score of each normalized symptom. erefore, the models can output the top 5 and 10 normalized symptoms by score ranking to achieve retrieval. We used recall to observe the ability of retrieval, as shown in Table 8. e results show that the BERT-Classification model has advantages in retrieval.
To further demonstrate the advantages of our model, we summarized the test results on HFDS. According to the results, we comprehensively compared the performance and  Evidence-Based Complementary and Alternative Medicine applicability of our model with that of existing models, as shown in Table 9.

Discussion
e normalization of expressions of TCM synonymous symptoms plays an important role in the collation of medical records, statistical mining, construction of TCM knowledge databases, and construction of TCM medical assistant decision-making systems [9]. e application of NLP technology improves the efficiency of normalization processing. NLP algorithms based on neural networks have been applied in normalizing biomedical texts [13,14] but not in normalizing the expressions of TCM synonymous symptoms. In this study, multiple models were constructed with NLP algorithms based on Bi-LSTM and the BERT neural network to explore the normalization of expressions of TCM synonymous symptoms.   In TCM synonymous symptom normalization, the performance of normalization and the ability to handle one symptom corresponding to multiple normalized symptoms are crucial to the normalization model. e test results show that our BERT-Classification model outperforms previous models and has the ability as mentioned above, while previous models do not have. In addition, the model also supports retrieve normalized candidate symptoms. Our model can retrieve other candidate normalization symptoms according to original symptoms when the model does not provide suitably normalized symptoms.
ese advantages of the model provide technical support for the efficient normalization of TCM synonymous symptoms and make the model highly adaptable in medical situations.
In this study, the accuracy, recall, precision, and F1score metrics were used to evaluate the performance of each model. e results show that the BERT-Classification model outperformed other existing models with respect to various metrics; these models include the proposed Encoder-Decoder, Encoder-Classification, and BERT-UniLM designed in this study. is is because the performance of NLP models based on neural networks is strongly related to the extracted semantic features, and BERT excels in extracting semantic features [24]. erefore, the BERT-Classification model, which extracts semantic features using BERT, is advantageous for normalization tasks. BERT-Classification, BERT-UniLM, and BERT-based ranking are all based on the BERT neural network; they differ only in their output layers due to their different modeling concepts. e results suggest that BERT-Classification performs best; therefore, the classification-based modeling concept may be the most conducive to normalizing TCM symptoms.   Note. e results are expressed as mean ± SD, and the threshold value of the sigmoid function was 0.2. a P < 0.05, compared with BERT-UniLM (Char); b : P < 0.05, compared with BERT-UniLM (Label). c P < 0.05, compared with BERT-Classification.

Evidence-Based Complementary and Alternative Medicine
With regard to applicability, our proposed BERT-Classification model supports both the processing of the original symptoms that correspond to multiple normalized symptoms and the retrieval of normalized symptoms. We use sigmoid as an output function to handle the situation in which each original symptom corresponds to multiple normalized symptoms; this method is effective and outperforms sequence generation methods. Moreover, for the model to support the retrieval of normalized symptoms, it requires a higher recall. Our BERT-Classification model can increase the recall by reducing the output threshold of the sigmoid function and thereby support retrieval.
In contrast to BERT-Classification, the other reported models cannot support both of the above applications simultaneously. Jaccard similarity, DNorm, Word2vec with cosine, and BERT-based ranking pair an original symptom with each normalized symptom and rank the normalized symptoms by their pairing score. Although these models can   Note. e test results are expressed as mean ± SD. Each model was repeated 10 times, except for Jaccard similarity. * P < 0.05, compared with BERT-Classification. output multiple normalized symptoms by ranking them for retrieval, when multiple normalized symptoms corresponding to the original symptoms need to be output precisely, it is difficult to decide whether the results (except for the normalized symptom with the highest score) should be output. e Bi-LSTM-CNNs-CRF model is only designed for outputting a single normalized symptom. In addition, because the model is based on the NER modeling concept, it cannot produce multiple candidate normalized symptoms, as the above models can, and therefore cannot be applied to the retrieval task. Although the Encoder-Decoder and BERT-UniLM models support the output of multiple normalized symptoms, they suffer from the same limitations as Bi-LSTM-CNNs-CRF and are not suitable for the retrieval of normalized symptoms. e HFDS contained only high-frequency samples for modeling and testing, reflecting the performance of the BERT-Classification model under ideal conditions. Conversely, the TDS included both high-frequency and lowfrequency samples, reflecting the performance of the model in practical applications. Comparing the results of the model on the two data sets, the performance on TDS was lower than that on HFDS. is suggests that the performance of the model can be improved by increasing the number of lowfrequency samples.

Conclusions
is study constructed models to normalize TCM synonymous symptoms from the perspectives of text classification and sequence generation of NLP. e optimal model is the BERT-Classification model, which outperforms existing reported models in dealing with original symptoms that correspond to a single normalized symptom. Moreover, it also supports original symptoms that correspond to multiple normalized symptoms, and it has the ability to retrieve normalized symptoms. e limitation of this study is that the normalization models only explore symptoms. Whether the models can be used for normalizing other synonymous terms, such as TCM treatment terms and TCM disease terms, remains to be further studied. In addition, the pretrained BERT model based on large-scale corpora plays an important role in improving the model performance; the BERT model trained on corpora from professional medical fields is likely to achieve better results for normalization of medical terms. erefore, the use of a large number of TCM literature corpora to construct the pretrained model, to improve the normalization performance, also needs further research.

Abbreviations
BERT: Bidirectional encoder representation from transformers Bi-LSTM: Bidirectional long short-term memory DR: Dropout rate FN: False negative FP: False positive HFDS: High-frequency data sets LR: Learning rate MC: Memory cell N/A: Not applicable NLP: Natural language processing RNN: Recurrent neural network SD: Standard deviation TCM: Traditional Chinese medicine TDS: Total data sets TP: True positive UniLM: Unified language model.

Data Availability
All the data and materials used in the current study are available from the corresponding author on reasonable request.

Ethical Approval
Not applicable.

Consent
Not applicable. Note. a means the ability to handle complex symptoms, if the model has this ability, it is evaluated for performance using F1-score; b means the ability to retrieve normalized symptoms, if the model has this ability, it is evaluated for performance using top 10 recall; c stands for the overall performance of normalizing single symptoms and complex symptoms, if the model has the ability of handling single symptoms and complex symptoms, it is evaluated by F1score. √ indicates that the model has this ability or can be evaluated for overall performance. × indicates that the model does not have this ability or cannot be evaluated for overall performance.
Disclosure e funder has no role in study design, data collection, analysis, decision to publish, or manuscript preparation.

Conflicts of Interest
e authors declare that they have no competing interests.

Authors' Contributions
YH Li and FQ Xu guided the whole work; L Zhou and SQ Liu developed all models; CY Li, YD Li, FQ Xu, and YM Sun collected medical data; SQ Liu and YZ Zhang performed data labeling; Y Sun and YH Li checked all labels; Y Sun and HM Yuan calculated all metrics. All the authors read and approved the final manuscript. L Zhou and SQ Liu contributed equally to this work.