Named Entity Recognition in Chinese Medical Literature Using Pretraining Models

)e medical literature contains valuable knowledge, such as the clinical symptoms, diagnosis, and treatments of a particular disease. Named Entity Recognition (NER) is the initial step in extracting this knowledge from unstructured text and presenting it as a Knowledge Graph (KG). However, the previous approaches of NER have often suffered from small-scale human-labelled training data. Furthermore, extracting knowledge from Chinese medical literature is a more complex task because there is no segmentation between Chinese characters. Recently, the pretrainingmodels, which obtain representations with the prior semantic knowledge on large-scale unlabelled corpora, have achieved state-of-the-art results for a wide variety of Natural Language Processing (NLP) tasks. However, the capabilities of pretraining models have not been fully exploited, and applications of other pretraining models except BERT in specific domains, such as NER in Chinese medical literature, are also of interest. In this paper, we enhance the performance of NER in Chinese medical literature using pretraining models. First, we propose a method of data augmentation by replacing the words in the training set with synonyms through the Mask Language Model (MLM), which is a pretraining task. )en, we consider NER as the downstream task of the pretraining model and transfer the prior semantic knowledge obtained during pretraining to it. Finally, we conduct experiments to compare the performances of six pretraining models (BERT, BERT-WWM, BERT-WWM-EXT, ERNIE, ERNIE-tiny, and RoBERTa) in recognizing named entities from Chinese medical literature. )e effects of feature extraction and fine-tuning, as well as different downstreammodel structures, are also explored. Experimental results demonstrate that the method of data augmentation we proposed can obtain meaningful improvements in the performance of recognition. Besides, RoBERTa-CRF achieves the highest F1-score compared with the previous methods and other pretraining models.


Introduction
In recent decades, it has been generally known that the rapid growth of information technology has resulted in huge amounts of information generated and shared in the field of medicine, where the number of published documents, such as articles, books, and technical reports, is increasing exponentially [1]. For example, PubMed houses over 380,000 publications found by just searching the keyword "Diabetes" (Jan. 2009 to Oct. 2019). e medical literature contains valuable knowledge, such as the clinical symptoms, diagnosis, and treatments of a particular disease. However, it is time-consuming and laborious for medical researchers to obtain knowledge from these documents. us, it is critical to extract information and knowledge from unstructured medical literature using novel information extraction techniques and present the findings in a visually intuitive Knowledge Graph which supports machine-understandable information about the medicine [2,3].
Named Entity Recognition (NER) is the fundamental task in Natural Language Processing (NLP). It is also the initial step in extracting valuable knowledge from unstructured text and building a medical Knowledge Graph (KG). As shown in Figure 1, NER aims to recognize entities from unstructured text, and the results of NER may affect subsequent knowledge extraction tasks, such as the Relation Extraction (RE). In the early years, researchers used rulebased or dictionary-based methods for NER tasks [4,5]. However, these methods lack generalization, for they are proposed for particular types of entities. Traditional machine learning and deep learning methods emerging in recent years are also used in NER tasks [6]. Nevertheless, the performance of these methods often suffers from small-scale human-labelled training data, resulting in poor generalization capability, especially for rare words. Moreover, recognizing entities from Chinese documents is a more complex task because there is no segmentation between Chinese characters. Furthermore, in the field of Chinese medical literature, some English symbols, such as the chemical symbols Na and K, may appear in the documents, which makes the NER task more difficult. erefore, it is of interest to know whether the prior semantic knowledge can be learned from large amounts of unlabelled corpora to improve the performance of NER.
Recently, pretraining models (e.g., BERT and ERNIE) have achieved state-of-the-art (SOTA) results on several NLP tasks. e pretraining models obtain prior semantic knowledge from large-scale unlabelled corpora through pretraining tasks and improve the performance of downstream tasks by transferring this knowledge to them. However, the capabilities of pretraining models have not been fully exploited, and most of the previous works have focused on BERT [7,8], but applications of other pretraining models in specific domains, such as NER in Chinese medical literature, are also of interest.
In this paper, we enhance the performance of NER in Chinese medical literature using pretraining models. e dataset we used is "A Labelled Chinese Dataset for Diabetes (LCDD)," which contains authoritative Chinese medical literature in recent seven years. e main contributions of this paper can be summarized as follows: (1) Firstly, we proposed a method of data augmentation based on the Masked Language Model (MLM). Pretraining models will predict the masked words during the procedure of MLM, which can be used for synonym replacement to augment the training set [9]. Considering that there is no segmentation between Chinese characters, we choose ERNIE to conduct this task because it has the entity-level and phrase-level masking strategies.

Related Work
In this section, we will introduce the related works of the Named Entity Recognition, pretraining models, and data augmentation.

Named Entity Recognition.
e Named Entity Recognition aims to identify chunks of text which refer to specific entities of interest, such as drugs, symptoms, treatments, and diseases. Rule-based and dictionary-based approaches had played an important role. For example, Gerner et al. [10] used a dictionary-based approach to identify species names in biomedical literature. Fukuda et al. [11] proposed a rulebased method to extract material names such as proteins from biological documents. However, these methods lack generalization because they need hand-craft rules. Researchers also tried using machine learning methods to recognize entities from unstructured data. He et al. [12] presented a CRF-based approach to recognize drug names in biomedical texts. Wang et al. [13] compared six biomedical NER tools based on the Hidden Markov Model (HMM) and Conditional Random Field (CRF). Nevertheless, machine learning methods need to choose a set of features manually, which is time-consuming and laborious. In recent years, deep learning methods, which can improve the performance of NER without feature engineering, have received increasing attention. For example, Zhu et al. [14] proposed an end-to-end deep learning approach for biomedical NER  [16] used dictionary features to help identify rare and unseen clinical named entities. However, deep learning methods still suffer from insufficient training data.

Pretraining Models.
Recently, the pretraining models, which generate representations of words with prior semantic knowledge on large-scale unlabelled corpora, have achieved state-of-the-art results for a wide variety of NLP tasks [17]. Various pretraining models have emerged after Devlin et al. [18] released BERT in 2018. ese models consist of multilayer bidirectional Transformer blocks [19]. e main differences among pretraining models lie in the pretraining tasks and pretraining corpora. Table 1 shows the difference in detail. We denote the number of Transformer layers as L, the hidden size as H, and the number of self-attention heads as A. During the procedure of the Next Sentence Prediction (NSP), which is a kind of pretraining task, the pretraining models are trained to predict whether two sentences have a contextual relationship, and the pretraining models can understand the relationship between the sentences in this way.
For the NER task, Devlin et al. [18] first consider NER as a downstream task of BERT for extracting named entities from the news (MSRA-NER). Pires et al. [7] realized zero-shot NER through multilingual BERT. Besides, pretraining models are also used on domainspecific NER, such as biomedicine. For example, Hakala and Pyysalo [8] applied a CRF-based baseline approach and multilingual BERT to the Spanish biomedical NER task. However, the capabilities of pretraining models have not been fully exploited. Furthermore, applications of other pretraining models except BERT in specific domains, such as NER in Chinese medical literature, are also of interest.

Data Augmentation.
A common approach of data augmentation in the area of NLP is synonym replacement [24]. A previous work found synonyms with k-nearest neighbours using Word2Vec [25]. However, the MLM of pretraining models is more suitable for synonym replacement. It is not only because the word representations obtained by the pretraining models contain more abundant semantic knowledge than previous models but also because Word2Vec cannot handle polysemous words. Wu et al. [9] proposed a method of data augmentation based on BERT. However, BERT will mask the Chinese characters, not words, during the procedure of the MLM because there is no segmentation between Chinese characters. erefore, we perform data augmentation based on ERNIE because it has entity-level and phrase-level masking strategies in the MLM process. e method of data augmentation will be presented in Section 3.1.

Data Augmentation Using ERNIE.
As mentioned earlier, the Masked Language Model (MLM) is intensely suitable for data augmentation. During the procedure of the MLM, a certain portion (e.g., 15%) of words are replaced by a special symbol [MASK], and the pretraining model is trained to predict the masked word. Specifically, for a token sequence x � x 1 , . . . , x T , the pretraining model first constructs a corrupted sequence x by randomly setting a portion of tokens in x to a special symbol [MASK] [26]. e training objective is to reconstruct x from x: where m t � 1 indicates that x t is masked. e whole process is like a Cloze task [18]. We repeat the process of MLM using a trained pretraining model. e model is not retrained and is only used to predict masked words. Obviously, the words predicted by the model can be regarded as the synonyms of the masked words. We perform data augmentation based on ERNIE because it has entity-level and phrase-level masking strategies in the MLM process. A visualization of the process can be seen in Figure 2. ERNIE randomly masks a portion of characters or words in the input sequence by default [21]. It is worth noting that masking the named entities is not appropriate because these entities may be proper nouns or rare words in medical literature, especially the disease and drug entities like "糖尿 病 (diabetes)" and "胰岛素 (insulin)" in Figure 2. When ERNIE predicts these entities, the result may not be correct Chinese words because the information of these entities may not be obtained during pretraining. erefore, we only randomly mask the tokens except for named entities. Furthermore, we input a single sequence that starts with a particular classification token [CLS] and ends with an ending token [SEP], because the context information of sentence pairs is not necessary, which is different from inputting sentence pairs during pretraining [18,21]. As shown in Figure 2, one sequence input into ERNIE consists of the following four parts: (1) Token IDs: We use the original vocabulary provided by ERNIE to get the ID number of each token. (2) Sentence IDs: ERNIE uses this mark to determine the sentences to which the token belongs. As mentioned earlier, we input the single sentence, not a sentence pair. Accordingly, all the sentence ID numbers are 0. (3) Position IDs: e Transformer cannot obtain position information through self-attention heads, since it contains no recurrence and no convolution [19]. erefore, the position ID number is injected to get information about the relative or absolute position of the tokens. (4) Segmentation IDs: e segmentation IDs represent the segmentation information. Specifically, "0" denotes the beginning of a word, and "1" does not Scientific Programming denote the beginning. Moreover, we assign "−1" to the corresponding position of [CLS], [SEP], and named entities. ERNIE will not mask the token where the segmentation ID equals "−1." We use THULAC (http://thulac.thunlp.org/) for word segmentation [27].
As can be seen in Figure 2, "病人 (patients)" and "口服 (take orally)" in the raw sentence are replaced by "患者 (patients)" and "注射 (be injected with)," respectively. ese two groups of words are synonyms in Chinese. We perform the above operation on all samples in the training set to obtain the dataset D ′ . Finally, we combine the dataset D ′ generated by ERNIE with the original training data D to get the augmented training data D aug .

Named Entity Recognition Using Pretraining Models.
We consider NER in medical literature as the downstream task of the pretraining model. As the pretraining models are pretrained on large-scale unlabelled corpora, the output of pretraining models can be regarded as the representations of tokens with prior semantic knowledge.
e key to using a pretraining model for NER is how to transfer the prior semantic knowledge obtained from the source domain to the target domain (e.g., Chinese medical literature NER in this paper). ere are two main approaches to transfer the prior semantic knowledge to the downstream tasks: feature extraction and fine-tuning [28]. For feature extraction, the parameters of pretraining models are fixed and only the parameters in downstream models are trained through the downstream task. e pretraining models are regarded as the feature extractors and output the representations of tokens with prior semantic knowledge in the source domain. e representations, which are higher-level and more abstract features, will be input into the downstream task. On the other hand, for fine-tuning, all the parameters of pretraining models and downstream models are trained through the downstream task.
e pretraining models will learn the semantic knowledge of the target domain from the training data of the downstream tasks. ese two approaches are illustrated by Figure 3, where areas marked by blue squares indicate that the parameters of the corresponding models are trained through the downstream task.
For the structure of downstream model, we test the following three common modules: Full Connection (FC), LSTM, and CRF. As shown in Figure 3, the LSTM and CRF are optional. e performance of different modules will be shown in the fourth section.

Experiments and Results
In this section, we will introduce the dataset for the NER task and show the results. e experiments were performed with PaddlePaddle, which is a framework of deep learning. For hardware, we used an eight-core CPU and a V100 GPU.

4.1.
Dataset. e dataset we used is "A Labelled Chinese Dataset for Diabetes," which is provided by Alibaba Cloud [29].
is dataset comes from the authoritative Chinese diabetes journals in recent seven years, from which the literature related to basic research, clinical research, drug usage, diagnosis, and treatment methods are selected. e dataset covers the latest research hotspots on diabetes and is labelled by professionals with a medical background. We divided this dataset into training set, development set, and test set within the ratio of 6 : 2 : 2. e details of the labels are given in Table 2.

Experiment Settings.
We tested the performance of NER from the following three aspects: (1) Using the method of data augmentation we proposed (2) Using pretraining models and common deep learning models like the BiLSTM (3) Using downstream models with different structures Firstly, we tested the performance using the original dataset and the augmented dataset. en, the performance of pretraining models, including the BERT series, ERNIE, and RoBERTa, was compared with common deep learning models, such as BiLSTM. Finally, we compared the performance when the downstream model is the LSTM or CRF. For the pretraining models, the parameters were established based on the pretrained parameters provided by their authors. For the downstream models, the weights were established using Xavier initialization, while the biases were initialized as 0. e hyperparameters are set up based on trial and error. We evaluated the performance at every 1000 steps on the development set, and the experiment would be terminated prematurely once the loss no longer drops. e final selection of the hyperparameters would be the best on the development set. All the hyperparameters involved are listed in Table 3.
For the evaluation, we introduced the precision, recall, and F1-score. e precision value refers to the ratio of correct entities to predicted entities. e recall value is the proportion of the entities in the test set which are correctly predicted. e F1-score is calculated according to the following formulation: It can be seen that the F1-score is the harmonic mean of the precision and recall, which can comprehensively reflect the performance of the model on NER tasks. We use P, R, and F to represent precision, recall, and F1-score, respectively.

4.3.
Results. Firstly, we tested the effects of data augmentation method we proposed. e augmented dataset is obtained through the MLM of ERNIE as described in the third section. We used three pretraining models (BERT, ERNIE, and ERNIE-tiny) based on the original dataset and augmented dataset, respectively. e parameters of pretraining models are updated through fine-tuning. e downstream model is a single-layer FC without the CRF or LSTM. e results are shown in Table 4. e performance of NER in Chinese medical literature can be improved when using the augmented dataset,  en, we compared the performance when using pretraining models and common deep learning models. e results are shown in Table 5. e parameters of pretraining models are also updated during fine-tuning, and the downstream model is a single-layer FC without the CRF or LSTM, too. As we can see from Table 5, using pretraining models can obtain meaningful improvements in the performance of NER. Among pretraining models, the F1-score of ERNIE-tiny is the lowest, at only 89.466%. In contrast, RoBERTa obtained the highest F1score with 91.209%. Moreover, the performance of BERT series models (BERT, BERT-WWM, and BERT-WWM-EXT) is relatively higher than that of ERNIE.
Furthermore, we also compared the two main approaches transferring prior semantic knowledge to the NER task: feature extraction and fine-tuning. For feature extraction, we fixed the parameters of pretraining models. On the contrary, the parameters of pretraining models were trainable and can be updated during fine-tuning based on the training set. e downstream model structure is also a single-layer FC without the CRF or LSTM. e results shown in Table 6 indicate that the F1-score can be slightly increased through fine-tuning.
Finally, we also tested the performance of different downstream model structures. RoBERTa was used as the pretraining model in this test. For the downstream model, we tested the FC, CRF, LSTM-CRF, and BiLSTM-CRF, respectively. For LSTM-CRF and BiLSTM-CRF, the dimension of the hidden layer was 128. It can be found from Table 7 that the performance of recognition reduced when a fairly complex model was used as the downstream model.

Discussion
In this section, we will discuss the experimental results in detail.

Data Augmentation.
Results also show that the augmentation method we proposed can increase the F1-score by approximately 0.14% on average. Although the improvement is not significant, the result is meaningful for it demonstrates that the data augmentation using ERNIE is feasible. As mentioned in Section 2.3, BERT will mask the Chinese characters, not words, during the procedure of the MLM because there is no segmentation between Chinese characters, and the results may not be grammatically correct Chinese sentences. However, the MLM of ERNIE can replace a portion of Chinese phrases or words with synonyms. e semantics of the new Chinese sentences generated by ERNIE are similar to those of the original sentences, and they are combined as the augmented dataset. We do not mask the named entities in light of these entities which may    Scientific Programming be proper nouns or rare words in the field of medical literature. e results also demonstrate that the augmentation method we proposed is meaningful and feasible.

Comparison of Pretraining Models with Common Deep
Learning Methods. Obviously, using pretraining models can obtain meaningful improvements in the performance of NER. e pretraining models have learned abundant prior semantic knowledge from the pretraining corpora (e.g., Chinese Wikipedia and Baidu News) [20,21]. Pretraining corpora can also be regarded as the "source domain." When conducting the NER task, the prior semantic knowledge will be transferred to the downstream task, which can also be known as the "target domain." e whole process can be regarded as transfer learning. Task-specific semantic knowledge contained in the target domain will be obtained during fine-tuning. On the contrary, the common deep learning models can only learn knowledge from the training set, also known as the target domain. e training process is done from scratch on the target domain, whether it is the baseline model (BiLSTM-CRF) or other deep learning models. erefore, these models can only learn the knowledge in the target domain from the training set. e experimental results also indicated that using pretraining models can get a meaningful increase in the F1-score by at least 3%.

Comparison between Pretraining Models.
We also compared the performances of the six most common pretraining models for NER in Chinese medical literature: BERT, BERT-WWM, BERT-WWM-EXT, ERNIE, ERNIEtiny, and RoBERTa. First of all, it is shown that the deeper the layer, the better the performance for the pretraining models with similar pretraining tasks and the same pretraining corpus, such as ERNIE and ERNIE-tiny. ERNIE has twelve Transformer layers, but ERNIE-tiny only has three Transformer layers. Although ERNIE-tiny increases the number of hidden units and optimizes the pretraining task with continual pretraining [30], three Transformer layers cannot extract semantic knowledge well. e F1-score of ERNIEtiny is the lowest among all the pretraining models.
Secondly, for pretraining models with the same model structure, RoBERTa obtains the highest F1-score. From the perspective of the pretraining task, RoBERTa removes the sentence-level pretraining task because Liu et al. [23] found that removing the NSP loss in BERT can slightly improve the performance of downstream tasks. For the NER in Chinese medical literature, the pretraining models do not need to learn sentence-level semantic knowledge during pretraining, because the inputs are all individual sentences, not sentence pairs. e NSP and Dialogue Language Model (DLM) of BERT and ERNIE are designed to improve the performance of specific downstream tasks, such as SQuAD 1.1, which requires reasoning about the relationship between sentence pairs. Moreover, as mentioned before, RoBERTa can acquire richer semantic representations with a dynamic masking strategy [23]. In contrast, BERT and ERNIE use static masking strategy in every pretraining epoch. erefore, their performance is slightly lower than that of RoBERTa.
Finally, different pretraining corpora will affect the performance of NER in Chinese medical literature for pretraining models with the same pretraining tasks and the same model structures, such as BERT-WWM and BERT-WWM-EXT. e pretraining corpus of BERT-WWM is the Chinese Wikipedia, while the pretraining corpus of BERT-WWM-EXT includes not only the Chinese Wikipedia but also News and Q&A [20]. e training dataset we used contains formal scientific literature, and the pretraining corpus of BERT-WWM is closer to it. e results in Table 5 demonstrate that the F1score of BERT-WWM is slightly higher than that of BERT-WWM-EXT.

Comparison of Feature Extraction and Fine-Tuning
Approaches. As shown in Table 6, the F1-score can be slightly increased through fine-tuning.
is phenomenon may indicate that the pretraining models can obtain semantic knowledge from the target domain during finetuning. In other words, the representations outputs from the pretraining models are not adapted to the specific NER task well when the pretraining models are only used as a feature extractor, because the task-specific representations cannot be obtained in this case. us, general-purpose representations can be obtained through fine-tuning. However, considering that the improvement is not significant and the feature extraction is computationally cheaper than finetuning, the transfer method should be selected in light of specific conditions in practice.

Comparison of Different Downstream Model Structures.
According to the results in Table 7, RoBERTa-CRF obtained the SOTA results. For the NER task, there are strong dependencies across labels. For example, the I-Drug label must follow the B-Drug label. As a probability model, the CRF can output the predicted sequence according to the above rules.  Scientific Programming erefore, the performance of RoBERTa-CRF is better than that of RoBERTa-FC with only one FC layer. e experimental results in Table 7 also demonstrate that adding the LSTM after RoBERTa does not improve the performance of recognition. e reason is that, on the one hand, the multiheaded self-attention network in the pretraining model has extracted the abstract representations of input tokens well. erefore, it is not necessary to add the LSTM to extract more abstract representations. On the other hand, a more complex network structure may cause overfitting, which will reduce the performance of recognition.

Conclusion
In this paper, we utilize the pretraining models to recognize the named entity in Chinese medical literature, which is the key step in building the medical Knowledge Graph. First of all, we propose a method of data augmentation based on the MLM of ERNIE. A portion of characters and phrases are replaced by synonyms except for the named entities in light of the fact that the named entities may be proper nouns or rare words in the field of medicine. Moreover, we consider NER as a downstream task of the pretraining models and transfer the prior semantic knowledge obtained during pretraining to it. e results of experiments demonstrate that not only can the data augmentation method we proposed improve the performance of recognition, but also using pretraining models can obtain a meaningful improvement compared with the common deep learning models. Furthermore, for NER in Chinese medical literature, the F1-score can be slightly increased through fine-tuning, and using a more complex downstream model will reduce the performance of recognition. For the future work, we will attempt to carry out experiments with a dataset labelled by ourselves and conduct Relation Extraction based on the entities recognized in Chinese medical literature.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.