TCMNER and PubMed: A Novel Chinese Character-Level-Based Model and a Dataset for TCM Named Entity Recognition

Intelligent traditional Chinese medicine (TCM) has become a popular research field by means of prospering of deep learning technology. Important achievements have been made in such representative tasks as automatic diagnosis of TCM syndromes and diseases and generation of TCM herbal prescriptions. However, one unavoidable issue that still hinders its progress is the lack of labeled samples, i.e., the TCM medical records. As an efficient tool, the named entity recognition (NER) models trained on various TCM resources can effectively alleviate this problem and continuously increase the labeled TCM samples. In this work, on the basis of in-depth analysis, we argue that the performance of the TCM named entity recognition model can be better by using the character-level representation and tagging and propose a novel word-character integrated self-attention module. With the help of TCM doctors and experts, we define 5 classes of TCM named entities and construct a comprehensive NER dataset containing the standard content of the publications and the clinical medical records. The experimental results on this dataset demonstrate the effectiveness of the proposed module.


Introduction
In recent years, with the booming of deep learning models, the applications of artificial intelligence technology in traditional medicine have achieved numerous achievements [1][2][3]. As a representative of traditional medicine, intelligent traditional Chinese medicine (TCM) has become a focused area. Many excellent works have been made in intelligent TCM such as TCM syndrome diagnosis based on symptom sequence [4][5][6][7], TCM herbal prescription generation [8,9], and TCM disease diagnosis [10][11][12]. Currently, the models for realizing these intelligent TCM tasks mainly rely on the labeled samples. One of the biggest issues that urgently need to be solved is the lack of publicly available labeled samples, i.e., the TCM medical records. e TCM medical records contain the necessary items for realizing the TCM tasks mentioned above, including the clinical manifestation of a patient, the syndrome and disease diagnosis and treatment laws provided by the TCM doctor, and the herbal prescription prescribed by TCM doctor. How to automatically identify and extract these mentioned items from the TCM medical records by deep learning models is an efficient way to continuously increase labeled TCM samples. e named entity recognition (NER) model which aims to identify the target entity from the text is a useful method to solve the above issue. Several TCM NER studies are proposed in recent years with various purposes [13][14][15][16][17][18][19][20][21]. However, an unavoidable fact is that the previous works only focus on a specific TCM resource. e target resources are either in the standard publications [16,17] or the clinical electronic medical records [13,15,[19][20][21]. Different types of TCM resources pose different challenges for researchers. In general, it is much less difficult to identify the named entities that appear in TCM publications than in clinical electronic medical records. To our best knowledge, no study has been proposed to deal with the TCM NER task on publications and the clinical electronic medical records simultaneously, mainly due to the lack of the dataset that contains both these two types of data. In addition, according to our observation, the previous works of TCM NER are mainly focused on partial aspects (usually 2-3) of TCM. e classification of the TCM NER types is also not proper.
To fill the gaps in the dataset and the classification of TCM named entity, in this work, we collaborate with the doctors and experts from Beijing University of Chinese Medicine to define 5 classes of named entities, i.e., clinical manifestation, disease, herb, syndrome, and treatment law.
ese classes include all types of terms that may appear in the process of TCM diagnosis and treatment. We also propose a Chinese character-level traditional Chinese medicine NER model, called TCMNER, and a NER dataset for TCM. e dataset is collected by ourselves and contains both the publications and clinical electronic medical records from various types of TCM resources (e.g., articles, electronic medical records, and books). e TCMNER makes use of the Chinese characterlevel representation, aiming to realize a comprehensive TCM NER task. e reason why we use the Chinese character-level representation is that TCM terms usually are of variable length, especially the clinical manifestation term, which poses great challenges for the TCM NER task. In the TCM terminology, especially the clinical, TCM doctors usually use a phrase or a short sentence to record a symptom of a patient for describing the symptom in as much detail as possible. is kind of TCM term usually contains more than 6 Chinese characters existing in a sentence. Since there is no separator between Chinese words, the commonly used method is to segment the sentence first and then extract the TCM terms. e sentence segmentation may divide the term that should be kept as a word into several parts, which will miss the necessary part of the term or label the wrong tags when the auto-tagging process is used, leading to the semantic fault. Besides, the performance of such Chinese NER models is largely dependent on the segmentation results. erefore, in this work, we argue that the character-level representation should be used for the TCM NER task. To relive the issue of lacking word/phrase semantic in the character-level representation, we propose a word semantics and word-character semantic relation integrating character-level representation strategy. e contributions of this work are summarized as follows: (i) We define and classify the TCM named entity into 5 classes according to the classification of TCM and the process of TCM diagnosis and treatment, called CSDTH classification, including Clinical manifestation, Syndrome, Disease, Treatment Law, and Herb.
(ii) We collect and construct a comprehensive NER dataset called PubMed, which consists of both standard contents of the publications and the clinical electronic medical records from various TCM resources. (iii) We propose a novel Chinese character-level representation strategy for the TCM NER task. (iv) We conduct a series of comprehensive experiments to verify the performance of the proposed models. e experimental results demonstrate that the proposed Chinese character-level representation can improve the models' performance with a prominent margin.

Related Works.
For alleviating the lack of the labeled NER sample, Wang et al. replaced the words in the training set with synonyms. A pretrained model was obtained on the augmented training set. en the prior semantic knowledge learned by the pretrained model was transferred to the downstream NER task [15]. Zhang et al. considered the distant supervision to substitute the human annotation and propose a novel back-labeling approach to deal with the potential challenge of entities that are not included in the vocabulary [14]. Qu et al. focused on the fuzzy entity recognition problem and proposed the Bert-BiLSTM-CRF model. eir proposed model has an advantage in identifying drug names [17]. Knowledge graph information is utilized by Jin et al. to tackle the rare-word recognition problem.
ey proposed the TCMKG-LSTM-CRF model which introduces a knowledge attention model to apply the attention mechanism between the hidden vector of neural networks and knowledge graph candidate vectors. is model also takes the influence of previous words in a sentence into consideration [21]. Song et al. paid attention to the lexicon information of the target sentence. ey incorporated the lexicon information into the representation layer of the BiLSTM-CRF. Experiments conducted on the "Shanghan Lun" dataset showed the outperformance of their method [16]. As for NER from Chinese electronic medical records, Gong et al. implemented a deep learning pretraining method, including word embedding and finetuning, as well as the BiLSTM and Transformer. is method identified four types of clinical entities including diseases, symptoms, drugs, and operations [19]. Liu et al. combined the BiLSTM-CRF model with semisupervised learning to reduce the cost of manual annotation and leveraged extraction results. e proposed method is of practical utility in improving the extraction of five types of TCM clinical terms, including traditional Chinese medicine, symptoms, patterns, diseases, and formulas [22]. Zhang et al. worked on building a fine-grained entity annotation corpus of TCM clinical records [13]. ey exploited a four-step approach: (1) determine the entity types through sample annotation, (2) draft a fine-grained annotation guideline, (3) update the guidelines until the prospective performance is achieved, and (4) use the guidelines developed in steps 2 and 3 to construct corpus. Yin et al. pointed out the drawbacks of the BiLSTMs model in NER such that this method can only capture contextual semantics between characters in sentences.
us, they improved the BiLSTM-CRF model with the use of the radical-level feature and self-attention mechanism. Results of the experiments show comparable performance [20]. As we discussed, a commonly used underlying method in the previous studies is the segmentation of the target sentence. It leads to the wrong tokens and the words with incomplete semantics. Besides, a comprehensive NER dataset and named entity schema in TCM are still not presented. In this study, we focus our attention on addressing these issues.

Traditional Chinese Medicine Named Entity Definition.
In this work, we collaborate with the doctors and experts from Beijing University of Chinese Medicine and Dongzhimen Hospital of Beijing University of Chinese Medicine to define and classify the TCM named entity systematically and comprehensively. After a throughout analysis of the previous works, we found that the classes of TCM named entity that are used in the previous works usually only consider partial types of TCM terms [15,16]. Some works subdivide one class of TCM named entity into several subclasses [13]. We analyse the TCM basic theory, different branches of TCM, and the terms used in the process of TCM clinical diagnosis and treatment, summarizing the TCM named entities into 5 classes. e summarized classes contain Clinical manifestation, Syndrome, Disease, Treatment law, and Herb, CSDTH for short.
As shown in Table 1, this is an example of the TCMrelated clinical records. ese 5 classes cover almost all aspects of TCM. For instance, the clinical manifestation entities contain all symptoms of a patient collected by the TCM doctor through four ways of diagnosis, namely looking (red tongue, less tongue coating), listening (wheezing due to retention of phlegm in throat), questioning (insomnia, dreaminess, palpitation, dry stool and once every 2 days, amnesia, tidal fever, and night sweating), and feeling the pulse and skin (small and weak pulse). It is worth noting that, in this work, the symptoms of tongue and pulse, e.g., red tongue and small and weak pulse, and systemic symptoms such as insomnia, dreaminess, and palpitation fall into the same category of clinical manifestation. According to the advice of the TCM doctors and experts, the symptoms of tongue and pulse belong to the symptom category, so that there is no need to subdivide them into subcategories. Another difference between this work and the previous works in the class of TCM named entity is the herb. In this work, we enable the model to identify the most valuable treatment unit-the herb instead of the prescription name, because, in the downstream TCM AI tasks such as the TCM prescription generation, the model needs to capture the interaction between symptoms and herbs and generate a set of or sequence of herbs to form a TCM herbal prescription. erefore, the identification of herbs is much needed than the prescription name. Based on this named entity classification strategy, we explore the BIO schema to define the entity tags which are shown in Table 2.

Word-Character Integrated Self-Attention.
As we discussed in introduction, we argue that the TCM NER should be accomplished by utilizing the Chinese character-level representation and tagging to maintain the complete semantics of the named entities of the long words, phrases, or short sentences. However, the character-level representation does not capture the word semantics and the phrase semantics. To alleviate this issue, we propose a novel module that can integrate the word-character semantic relation and word semantics into character-level representation, outputting the character-level representation with word semantics and word-character semantic relation. e overall architecture of the module is shown in Figure 1.
As shown in Figure 1, the character-level and word/ phrase-level representations are obtained by the embedding layers. en, an attention module takes the word representation and character-level representation as input to output the attention weights e j to each character in the word/phrase. Each attention weight e i j represents, according to the word/ phrase semantic, the importance of i-th character in the given word/phrase. When the attention weights are gained, the module can generate the word semantic and word-character semantic relation integrating character-level representation. e operations of this process are formulated as follows: where W * and b * means the trainable parameters and R character contains l vectors. After the attention weights are obtained, the new character level for each char is calculated as the weighted sum of R character , i.e., R i . Notice that the Embedding and attention operations are agnostic to the model; researchers can replace these two operations with any applicable functions, such that the Embedding operation can be replaced by the popular pretraining language models (e.g., BERT and ALBERT), so does the attention operations. In this work, the multihead selfattention operation is used to capture the word semantics, the word-pieces semantics, and the word-character semantic relations. How to apply the multihead self-attention to generate the new character-level representation based on the word-and character-level representation is shown in Figure 2. e self-attention module takes the character-level representation as its key and value, and the word/phraselevel representation as its query. In this way, the information of word/phrase and character can interact with each other, and the new character representation can be generated by fusing these two types of information. is module is a plugand-play unit that is readily combined with the other models to do the TCM NER task. We will verify its effectiveness and efficiency in the next section in detail.

Datasets and Metrics.
As we discussed before, due to the limited availability of TCM resources, there is no comprehensive TCM NER dataset that contains both standard publications and the clinical medical records. To fill this gap, with the help of doctors and experts from Beijing University of Chinese Medicine and Dongzhimen Hospital of Beijing University of Chinese Medicine, we first collect the standard content of the books including the Basic eory of Traditional Chinese Medicine, the Diagnostics of Traditional Chinese Medicine, the Surgery of Traditional Chinese Medicine, the Traditional Chinese Pharmacology, and the open accessed TCM articles. We omit the unnecessary content of these publications and only retain the text. en, all retained texts are split into sentences of which each sentence is regarded as a NER sample. e desensitized clinical medical records are provided by Dongzhimen Hospital of Beijing University of Chinese Medicine. We retain the terms of clinical manifestation, syndrome diagnosis, disease diagnosis, treatment law, and herbs (the herbs in the prescription, not the prescription name) from each medical record. We combine these two datasets as the entire dataset. e statistics of the dataset is shown in Tables 3 and 4. is comprehensive dataset contains 94380 samples in total. We split the dataset into training, validation, and test sets with the proportion of 6 : 2 : 2. e samples of clinical manifestation, syndrome, disease, treatment law, and herb in the training set are 24332, 7613, 2808, 11186, and 11682, respectively. e total samples of these 5 classes are larger than the 94380, because some of the samples both contain multiple types of entities.   We also calculate the number of the 5 classes for each dataset. As shown in Table 4, we notice that in both datasets the most common entity type is the clinical manifestation. In the publication dataset and medical record dataset, there are 18150 and 75177, respectively. Medical records contain more entities than publications. e reason is that there is a great deal of content in publications to explain and prove the theories and results. We also note that the number of disease entities is the lowest in both datasets. We count the samples for each dataset and notice that the although number of samples in the medical record dataset is less than the publication dataset, the number of entities of all classes in the medical record dataset is far more than the publication dataset. It demonstrates that the medical record contains more effective entities than the publications in each sample.
In this work, we conduct a series of comprehensive experiments to verify the proposed module. e comparison models include (1) BiLSTM-CRF, a bidirectional long-short term memory network (LSTM) with conditional random field (CRF) layer that is the most popular architecture for    (9) RoBERTa-BiLSTM, a RoBERTa model combined with a bidirectional LSTM layer. All the comparison models are explored with different purposes. For evaluation metrics, we introduce the precision, recall, and F1-score. Precision refers to the ratio of correct entities to predicted entities. e recall is the proportion of the entities in the test set which are correctly predicted. e F1-score is a balanced measure of precision and recall and is calculated by the following formulation: e traditional way to calculate the values of precision, recall, and F1-score is based on the classification results of the NER model, which reflects the performance of the model in classifying samples into the desired category. In addition to the traditional classification evaluation, we introduce a rigorous method to calculate the values of precision, recall, and F1-score, called identification. In TCM NER, in addition to the entities of the target class, there are plenty of entities marked with "O". e entity labeled "O" is not a useful entity in real-world scenario. e models that can classify the target entities into their correct categories are much more important than the models correctly classifying the entities as "O". us, in this kind of experiment, we filter out the "O" tags in each label and only retain the target 5class entity tags and their positions in the original sample. e predicted tags are also filtered by these positions. In this way, we can focus our attention on verifying the models' performance to identify the useful types of entities. Since the aim of TCM NER is that the trained NER model can obtain a higher identification performance in new electronic medical records, we train all comparison models in publications first. en, all models are evaluated in the medical records to verify their recognition performance. e publication dataset is also divided with the proportion of 6 : 2 : 2.

Experimental Results.
We take the RoBERTa with wordcharacter integrated self-attention model, called RoBERTac, as our basic model to compare with other models. We trained the BiLSTM-CRF, BERT-CRF, BERT-BiLSTM, BERT-BiLSTM-CRF, RoBERTa-BiLSTM, and RoBERTa-c to verify the performance of different models on the publication dataset. en, the test set of publication and all samples in the medical record dataset are used for verification purposes. e precision, recall, and F1-score of all comparison models are shown in Table 5. As shown in Table 5, the BiLSTM-CRF model obtains a higher precision than BERT-CRF on both publications and medical records.
is is inconsistent with researchers' intuition since the only difference between these two models is the representation extraction layer. e BERT's ability to extract the contextual representation is better than LSTM, which is proved in numerous studies, whereas the recall of BERT-CRF is far better than BiLSTM-CRF. Comparing the BERT-BiLSTM with BERT-BiLSTM-CRF, we notice that, without the CRF layer, the BERT-BiLSTM gains a significant improvement on both datasets. Combined with the performance of BERT-CRF and BiLSTM-CRF, the performance decrease of BERT-CRF and BiLSTM-CRF may be caused by the layer of CRF. e RoBERTa-BiLSTM obtains a similar result with BERT-BiLSTM on publication dataset but gains a better performance on medical record dataset, which gains a 1.4% F1-score. With the assistance of the proposed character representation, the best performance is obtained by using only RoBERTa among all comparison models. e F1-scores on both datasets are higher than 90%. It is noticed that, for all comparison models, almost all models obtain a higher precision, recall, and F1-score on the test set of publications than medical records except for the BILSTM-CRF.
To verify the generalization ability of the proposed wordcharacter integrated self-attention module for all models compared in this work, we conduct the ablation studies of the whole models. We combine each model with the wordcharacter integrated self-attention module and rename it as " * -c," where the " * " represents the original model's name and "-c" means the word-character integrated self-attention module. In these experiments, we also verify each model's performance on every category of the TCM named entity. For each category, we only retain the samples that contain only one type of entities and filter out the samples that contain the other 4 types of entities, forming 5 distinctive datasets. e F1-scores for all models on 6 datasets are shown in Table 6.
As shown in Table 6, we notice that the performance of each model with the word-character integrated self-attention module is improved to a certain degree. Each model with the proposed module obtains a 1%-2% improvement on F1-score. We notice that the biggest improvement of performance is BERT-BiLSTM-CRF, which means that the word-character integrated self-attention module can effectively relieve the performance decrease caused by the CRF layer. Comparing BERT, BERT-LSTM, and BERT-BiLSTM,  we found that the BERT obtains the best performance on all 6 datasets. Adding the LSTM layer following the BERT causes the performance decrease. e reason might be that the representation obtained by BERT already contains the bidirectional contextual information of the sentence/text. e LSTM layer can only use the semantic information in one direction, which erases the necessary information in the other direction. e reason we came to this assumption is All compared models achieve a better performance in the way of identification than classification. It shows that all models misclassify the character that its gold label is "O" into "B- * /I- * . We notice that the difference of RoBERTa-c between classification and identification is the smallest. that when the BiLSTM layer is used after the BERT, its performance is significantly improved. e same situation happened in the experiments of RoBERTa-based models.
e RoBERTa obtains the best results on all 6 datasets when compared with RoBERTa-LSTM and RoBERTa-BiLSTM. e LSTM layer also causes the performance decrease on RoBERTa and the BiLSTM layer improves the performance back. Both models that only use the BERT and RoBERTa almost gain the highest F1-score among the other corresponding LSTM-based two architectures, which shows that the underlying language models can capture the good contextual representation from the target sentences. e BERT-c and RoBERTa-c demonstrate that the word-character integrated self-attention module can make it better. In this work, the experiments show that the LSMT layer, BiLSTM, and CRF layer may cause the performance decrease. For all datasets, we found that all models achieve poorer performances on the medical record dataset than the publication dataset. is may be caused by two reasons: (1) all models are trained on the training set of the publication dataset and fitted well on it and (2) the medical record datasets are the clinical electronic records whose terms are not filled and wrote very formally.
In real-world scenarios, the aim of TCM NER is to identify the useful named entities from the clinical electronic records, e.g., clinical manifestation, syndrome, disease, treatment law, and herb, instead of the unnecessary entities such as the patient's name and the patient's age. In the schema of NER, these unnecessary entities are tagged as "O". us, models that can classify the character of the entity into "B- * /I- * " correctly is useful and appreciated more than the models that can accurately classify the character of the entity into "O". us, to verify this ability of the representative models, we complete the experiments in the identification way. e identification F1-scores of all comparison models are shown in Figure 3. As shown in Figure 3, all models obtain improvements in identification than classification, which means the ratio of correctly classifying the true entities is larger than the ratio of correctly classifying the true nonentities. On the publication dataset (the left two pictures), the improvement of RoBERTa-c between identification and classification is the smallest when compared with other representative models. It demonstrates that this model has the stability in identifying the useful TCM named entity, which shows the effectiveness of the word-character integrated self-attention module.
e BERT-BiLSTM-CRF model's performance decreases from 96.1 to 75.3 on disease category, which means the ratios of both misclassifying the "O" entities and "B- * /I- * " entities are relatively high. e same situation about BERT-BiLSTM-CRF also happened on the medical records dataset. On the medical record dataset (the right two pictures), the RoBERTa-c still gains consistent performance. We notice that all models in this experiment obtain relatively large improvements on the medical record dataset than the publication dataset, which means that all models fit less well on the medical records dataset than the publication dataset. However, considering that all models are trained without any samples that come from the medical record dataset, the language-model-based models actually obtain promising performance, especially the model with the word-character integrated self-attention module. We also notice that the performances of all models on treatment law and herb are both higher than other types of entities, especially in medical records. is is mainly because of the normalization of these two types of entities. ese two types of entities are usually more standardized terms, while the other three types are more irregular, especially for the clinical manifestation entities.

Conclusions
In this work, we work with the TCM doctors and experts to collect a publication dataset and a medical record dataset to fill the gap of lacking the comprehensive TCM NER datasets. ese datasets contain not only the standard contents that are extracted from the books and articles but also the clinical electronic medical records, which pose the more challenging datasets for TCM NER. We systematically define the 5 types of TCM named entity according to all aspects of TCM diagnosis and treatment, called CSDTH classification strategy.
e CSDTH includes the Clinical manifestation (the pathological information collection), Syndrome (the TCM diagnosis of the course of the disease), Disease (the diagnosis of TCM disease), Treatment law (the decision of treatment principles), and Herb (the concretely used medicines). To handle the variable length of the potential entities, we argue that the character-level representation and tagging might be more suitable for the TCM NER task and propose a wordcharacter integrated self-attention module to generate a new level character representation. e exhaustive experiments demonstrate the effectiveness of the proposed module and the pros and cons of different models.

Data Availability
e datasets used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.