Research on Named Entity Recognition of Electronic Medical Records Based on RoBERTa and Radical-Level Feature

Clinical named entity recognition (CNER) identifies entities from unstructured medical records and classifies them into predefined categories. It is of great significance for follow-up clinical studies. Most of the existing CNERmethods fail to give enough thought to Chinese radical-level characteristics and the specialty of the Chinese field. This paper proposes the Ra-RC model, which combines radical features and a deep learning structure to fix this problem. A bidirectional encoder representation of transformer (RoBERTa) is utilized to learn medical features thoroughly. Simultaneously, we use the bidirectional long short-term memory (BiLSTM) network to extract radical-level information to capture the internal relevance of characteristics and stitch the eigenvectors generated by RoBERTa. In addition, the relationship between labels is considered to obtain the optimal tag sequence by applying conditional random field (CRF). The experimental results demonstrate that the proposed Ra-RC model achieves F1 score 93.26% and 82.87% on the CCKS2017 and CCKS2019 datasets, respectively.


Introduction
Named entity recognition (NER) refers to the extraction of specific entities from unstructured texts, which plays a vital role in subsequent tasks, such as constructing knowledge graphs and personalized recommendation systems [1][2][3]. In recent years, with the rapid development of medical information technology, textual data of electronic medical records (EMRs) keep on increasing. As a fundamental Chinese medical information extraction task, named entity recognition of Chinese clinical EMRs has attracted extensive attention [4].
NER of clinical EMRs relates to the automatic discovery of all kinds of named entities closely associated with patients' health from EMRs, such as disease, drugs, or symptoms. Early researches in the CNER tasks mainly use lexiconsbased and rule-based approaches [5,6]. And then, a lot of statistical models are used for CNER [7,8]. With the substantial increase in hardware computing power, the deep learning method has been successfully applied to CNER. At present, many research approaches have focused on exploring a generic domain model for migration. Traditional bidirectional long short-term memory networks [9,10] and unsu-pervised pretraining of language models [11][12][13][14] are widely migrated to the CNER field. Both neural network algorithms have accomplished state-of-art achievement on the regular named recognition field. However, these models also have a room for improvement. First, the generality of the LSTM network leads to the model has no adequate capacity to extract features, where the extracted features are limited by the correctness of the dataset annotation and the context information. Second, the released versions of the pretraining model are more suitable for the general Chinese entity extraction. Both of them do not adapt to the characteristics of the EMR dataset, which underperforms on the task of medical entity extraction.
Moreover, the identification of Chinese clinical named entity recognition has been problematic. Firstly, many clinical named entities are multiword, and some of them are even being very long. It is not easy to distinguish the word boundaries of medical multiword in Chinese. What is more, the identical word and phrase can be divided into different kinds of named entities, for example, stroke can be delegated a modifier, and it can be additionally classified into particular disease and disease class and so on [15]. In addition, some specific types of medical entities often have characteristics different from the general ones, especially in the radicallevel characteristics of the entity. For instance, many characters of disease entities tend to have "疒" radicals, such as "病," and "痛." In ancient Chinese characters, "月" is related to human organs and flesh. Furthermore, many entities that consist of body parts often have "月" radicals, such as "脏," "脑," and "骨." These radical-level characteristics also have a significant reference value in determining labels, especially in complex medical entities consisting of multiple categories, such as the disease entity of "body parts and symptoms" format. However, this information has not been fully utilized by the regular named entity recognition model.
To address these issues, we propose a Ra-RC model which combines radical information with a deep learning structure. Above all, we adopt BiLSTM to encode radical characteristics. Simultaneously, RoBERTa is utilized to capture the characteristics of medical texts and generate characteristic representations. After that, we concatenate radical representations and characteristic representations and then use CRF to get predictive label sequences. Our proposed method has extensively evaluated its feasibility and utility on the CCKS2017 dataset and CCKS2019 dataset.
The main contribution can be summarized as follows: (1) Considering the particularity of the medical entity and the underutilization of the radical-level information, we use BiLSTM to extract the radical characteristics  [16,17]. For example, aiming at the lack of enough annotated data, Yang et al. [18] and Peters et al. [19] used transfer learning and semisupervised learning to extract entities, which could significantly improve the performance. There are many high-quality annotated data in the field of English CNER, such as JNLPBA, BC2GM, and NCBI. Due to the lack of high-quality EMRs and many nonstandard abbreviations, the Chinese CNER domain NER task is difficult [20].
An end-to-end deep learning method can be used to explore deeper features. The main network structure of this method is BiLSTM combined with CRF [21]. Li et al. [22] proposed a conditional random field algorithm that integrated characters, speech, and dictionary features based on establishing a medical dictionary. The experiments showed that these features were conducive to improving the CNER effect. Wang et al. [23] integrated dictionary features into the BiLSTM-CRF, and the results showed that prior knowledge helped improve the performance of the BiLSTM-CRF. Liu et al. [24] compared the CRF model requiring manual features with the LSTM-CRF without manual features and found that the F1 score of the LSTM-CRF on the i2b2 2010, 2012, and 2014 corpora was better than that of CRF. However, the above CNER methods based on a deep neural network could not model the polysemy of words. That is, they could not solve the problem of polysemy. Therefore, Devlin et al. proposed a bidirectional encoder representation from transformer pretrained language model (BERT), which used bidirectional transformer encoders to capture potential semantic relations and generated a pretrained language model. Based on BERT, Liu et al. put forward the RoBERTa model to enhance the performance of BERT. And then, Lan et al.'s ALBERT model, a lightweight BERT model, was put forward for using two strategies to reduce the size of BERT. Dai et al. [25] compared the model performance after Word2vec and BERT were fused with BiLSTM-CRF, and the experiment showed that the model performance would be better if BERT was fused with the traditional BiLSTM-CRF model. However, these models failed to consider the characteristics of medical datasets thoroughly, and the performance on medical entity extraction was not highly effective.

Radical-Level
Information. The specialization of the medical field leads to the particular linguistic structure of medical texts. Many experts have investigated on this characteristic. Peng et al. [26] put forward two types of Chinese radical-level hierarchical embeddings, and experimental results showed that radical-level semantics and sentiments on the sentence-level classification of emotions were better than char embeddings and word embeddings. A new deep learning technology referred to as "Radical Embedding" was proposed, and Shi et al. [27] conducted three experiments to verify its effectiveness. The results showed that the effect of radical embeddings was the same as competing methods and sometimes even better. Yin et al. [28] proposed BiLSTM-CRF based on radical features and used selfattention to capture character dependence. A new strategy was proposed to integrate dictionary information with characteristic presentation from BERT, and the F1 value of this method reached 91.60% and 89.56% on CCKS2017 and CCKS2018, respectively. However, in the existing researches, the information of radical has not been fully utilized.

Radical Characteristics.
The Chinese electronic medical record datasets are different from the other datasets. In the CCKS2017 and CCKS2019 datasets, the frequency of radical-level feature is shown in Figure 1.
As the introduction mentioned, the radical "月" is often associated with the human organ, the radical "疒" is often related with the disease, and the radical "口" frequently appears in symptom entities. As shown in Figure 1, the Chinese five elements "metal, wood, water, fire, and earth" are 2 Wireless Communications and Mobile Computing often included in medical entities. For example, "钅" correlates with microelement and drug names such as "钙" and "铁." "木" is related to "查体" and "脑血栓" and the name of the Chinese patent medicine. "氵" is associated with body fluids (plasma, tissue fluid, and lymphatic fluid) and symptoms such as "渗" and "溶." "火" has a relationship with inflammation-related entities such as "病灶" and "骨髓炎." "土" relates to modification words of a body part such as "壁" and "型." These radical features play an essential role in identifying medical entities [29]. The sources of the radicals include two parts: local dictionaries and Baidu Chinese dictionaries (https://hanyu.baidu .com). The local dictionary is created by crawling the familiar words of Xinhua Dictionary (http://xh.5156edu.com/). Thus, it generates a dictionary of key-value pairs in the form of "chars-radicals." 3.2. Design of Architecture. The proposed Ra-RC framework for the clinical named entity recognition task is shown in Figure 2. The framework mainly includes BiLSTM for radical-level representation, sequence modeling, and label inference layer. We train RoBERTa on both datasets where radical representations are extracting from BiLSTM. After that, we concatenate the char representations and radicallevel representations and then feed them into CRF to decode.

BiLSTM for Radical-Level Representation.
To make the most of the radical information, it needs to be extracted by a deep learning framework. From the perspective of theoretical and practical effects, both BiLSTM and RoBERTa are more suitable for feature extraction tasks, and RoBERTa enhances the performance based on BERT to have better expression ability. Therefore, this paper chooses these technologies to get contextual semantic information. Figure 3 shows an overview of the radical-BiLSTM model. Formally, the inputs contain two parts: word embedding and radical embedding. Firstly, each word finds its corresponding radicals using a mapping dictionary which was constructed. Secondly, both words and radicals pass through the same trainable matrix of the lookup layer. Afterward, for the preliminary representations of radical messages, we concatenate both embeddings recorded as X i , and then feed X i into the BiLSTM network to extract the feature.
As shown in Figure 3, the radical-level representation X i = ðx 1 , x 2 , ⋯, x n Þ is taken as an input to the BiLSTM network. The BiLSTM network has two kinds of LSTM cells [30] that extract the feature in the forward (h where σð·Þ denotes element-wise Sigmoid function and tanh ð·Þ denotes hyperbolic tangent functions. w is a weight matrix, and b is bias. i t , O t , and f t are called input gate, output gate, and forget gate, respectively.
The output of the BiLSTM network is referred to as C i , and characteristic representations, which are called P i , are extracted from RoBERTa. The final representationsO i splice C i and P i together.

Sequence Modeling.
We use the famous architecture of RoBERTa, which consists of the bidirectional transformer encoder for feature extraction and sentence modeling. As an autocoding language model, the model can introduce noise data to reconstruct the original data. It randomly selects some words to be predicted through the Mask language model mechanism and shields them with the [MASK] symbol. The training process is shown in Figure 4. Firstly, input sentences are segmented and annotated according to character level.
Secondly, the sentence is processed as a distributed representation Y = ðY 1 , Y 2 , ⋯, Y t , ⋯, Y n Þ, consisting of token embedding, segment embedding, and position embedding. Y t indicates the input status of each character: The transformer encoder is the most core component of the RoBERTa pretraining model, where multiheaded attention is the most critical module of the transformer unit.

Wireless Communications and Mobile Computing
The multiheaded attention mechanism is utilized to capture character dependencies. The calculation of the single-head attention mechanism is shown in equation (8).
where W Q i , W K i , and W V i are the weight parameters for i th calculation, respectively.
Then, the results of i th calculations are stitched together. Moreover, we linearly transformed once more to obtain the results of the multiheaded attention calculation. The specific formula is shown in equation (9), where W o is the weight parameter.
where E is the emission matrix output by the RoBERTa layer, and E i,j represents the probability that the i th word is classified into the j th label; T is the transition matrix, and T i−1,i refers to the score transferred from label i-1 to i; and s ðS, yÞ refers to the score of the label prediction sequence y generated by the input sequence S.
In a given input sequence S, the CRF model is trained using the maximized log-likelihood function. The formulas The higher the sðS, yÞ score, the greater the probability. Besides, Y x is the sequence of all the possible tags for a given sentence S, and log ðpðy | SÞÞ is the defined loss function.
In the decoding process, the Viterbi algorithm is used to solve the CRF global optimal sequence label. The formula is given below, where y * is the sequence in which the score function achieves the maximum value.
4. Experiments 4.1. Datasets. In this study, the CCKS2017 and CCKS2019 datasets are utilized to conduct experiments. The datasets contain actual EMR data, and a professional medical team manually annotated all EMR corpora. As we did not participate in the competition, the CCKS2017 dataset is incomplete. The numbers of the various types of medical entities are given in Figure 5. The Beginning, Inside, Outside (BIO) sequence labeling system, a standard labeling strategy in the NER field, is adopted in this study. Note that "B" means the starting position of the medical entity, "I" represents the middle position of the medical entity, and "O" indicates that it is not a medical entity, such as "B-X," "I-X," and "O", where X represents the type of medical entity.

Evaluation.
In this experiment, accuracy (P), recall rate (R), and F1 score are used as the comprehensive evaluation indexes of NER. The specific formulas are shown as follows: where TP is the number of correctly identified medical entities, FP is the number of unrelated medical entities identified, and FN is the number of unknown medical entities.

Environment.
In this experiment, the NER model is based on the TensorFlow framework. Besides, the hardware and software environments are listed in Table 1

Compare Three Pretraining Models. To better integrate
with the radical-level information, three pretraining models are trained and tested on the extraction of medical entities, and then, the best one was selected as our baseline model.
As observed in Tables 2 and 3, RoBERTa has the best effect of extracting entities. This reason is that RoBERTa has more data, more steps, and a large batch than BERT. Moreover, the RoBERTa-wwm-ext-large model has a 24tier transformer to get a more robust capability of feature extraction.

Ablation Experiments.
We take RoBERTa-CRF as the baseline model, and the comparison after adding the radical information is shown in Tables 4 and 5. RC stands for RoBERTa-CRF, and Ra-RC means adding radical information.
The extraction results of medical entities of CCKS2017 are shown in Table 4. The F1 values of the "Symptom" and "Check" categories are the highest, which are "96.53" and "96.36," respectively. Nevertheless, the recognition effect on "Treatment" is lacking, indicating that this type of entity is difficult to recognize. According to Figure 1, the sample size of this type of entity is small, accounting for only 3.60%. Hence, the neural network does not have enough samples to learn features, and the structure of entities is like the "Disease" entity, which is prone to classification errors. For example, consider "输卵管结扎术" and "输卵管结扎术后," they belong to different entity classes, where the former belongs to the "disease and diagnosis" entity class and the latter belongs to the "Treatment" entity class. On the whole, we can observe that our Ra-RC model based on radical-level information that BiLSTM extracts achieves the best performance with the F1 value of 93.26%, the precision of 94.14%, and the recall of 92.39% on the CCKS2017 dataset. The F1 value of the RA-RC is 1.2% higher than that of RC. In view of entity categories, the F1 values of all categories are higher than RC except for "Disease" and "Treatment."   At the same time, the extraction results of medical entities of CCKS2019 are shown in Table 5. It can be seen from Table 5 that the Ra-RC model combining radical-level information has an improvement of 1.9% in terms of F1 value compared with the RC model, which is without radicallevel information on the CCSK2019 dataset. All types of entities have increased except for the "disease" entity. "Medicine" has the best recognition performance of all entities where the F1 score reaches 92.77%. However, the recognition effect on the entity class of "Lab-Check" is insufficient, because some entities of "lab-check" in which the composition is complex often cause an error in boundary judgment. For instance, these entities are always composed of "letters and other characters," such as "CEA," "F/T," "T-PSA," and "CA125." Moreover, some image-check entities are made up of letters, such as "OR" and "CT". Due to the similarity of "imagecheck" and "lab-check" structures, the model cannot analyze the boundary between two kinds of entity classes.
In order to further compare the performance of RC and RA-RC, we also calculate the F1 value, recall, and precision of different methods, as shown in Figure 6. The RC (17) and RA-RC (17) represent that experiments conducting on the CCKS2017 dataset, RC (19) and RA-RC (19) are evaluated on the CCKS2019 dataset. In Figure 6, the F1 score of the Ra-RC model is higher than RC on both datasets.

Comparative Experiment with Existing Research Work.
In addition to the basic model described above, several researchers have conducted CNER studies on both datasets. For example, Li et al. [33] use a BiLSTM-CRF model combined with specialized word embeddings for CNER tasks. They use health domain datasets to create more prosperous and robust word embeddings. In addition to this, external health domain vocabulary is used to improve entity recognition results. Ouyang et al. [34] use the BiLSTM-CRF model combining the n-gram algorithm to the CNER tasks. At the same time, they introduce three types of external information as inputs to the model. Qiu et al. [35] use Chinese characters and dictionary features as input and then feed them into the     [36] propose a method that combines language model and multihead attention. Firstly, the sentence vectors are fed into BiGRU and the pretrained model. After that, this paper concatenates the output of them. Moreover, the output is given to the block of BiGRU and multihead attention. Wang et al. [23] construct a medical domain dictionary using relevant medical resources and then integrate the dictionary features and word vectors into the BiLSTM-CRF model to identify entities. Luo et al. [4] propose a CNER method that is based on ELMo and multitask learning. The ELMo is trained by adding the stroke features as input. Simultaneously, multitask learning is used to make full use of existing data to improve the model's performance. Yin et al. [28] propose the AR-CCNER model. The radical feature is extracted by the convolutional neural network (CNN). At the same time, this paper uses BiLSTMattention to capture contextual features and the dependency between characters.
The experimental results are shown in Table 6. However, although all experiments are based on the CCKS2017 dataset, our dataset may not be the same as those of above researchers because we did not participate in the competition.
The results show that the Ra-RC achieves better precision and F1 score on the CCKS2017 dataset. Li et al.'s [33] model performs the worst because their approach is based on word segmentation, which causes the model to fail to identify word boundaries well. The latter is much larger than the former compared to the character set and the word set. This means that the corpus is not sufficient for the model to learn word embedding information effectively. What is more, the results show that the model of Yin et al. [25], which combines radical information and performs well on CCKS2017. The F1 score of the model has achieved 92.79%. This also proves that the radical feature is helpful for entity extraction. Moreover, they use self-attention to capture intercharacter dependencies, enhancing the extraction ability of entity, and the F1 score of the model has achieved 93.00%. However, this method is not compared with the pretraining model, which is the mainstream model in the CNER field. We compare the entity extraction effects of three pretraining models (BERT/ALBERT/RoBERTa) on the CCKS2017 dataset and combine them with another mainstream CNER technique (BiLSTM) to improve performance. The results show that pretraining the model helps to improve the performance of the model.   The comparison of the CCKS2019 dataset is shown in Table 7. The experimental results show that the proposed model is superior to the baseline model in P, R, and F1 values. The F1 value of our model is slightly higher than the value of the model that Liang et al. [37] proposed, which indicates that the recognition ability of both is similar.

Conclusions
Aiming at the problem of insufficient medical entity extraction effect caused by the migration of the generic algorithm, we propose the Ra-RC model, which combines radical information extracted by BiLSTM with characteristic capturing by the pretrained model. To achieve a better entity extraction effect, we train three pretrained models for comparison. In addition, we introduce the radical feature, which can be seen as morphological information to enhance semantic information. After that, we concatenate both vectors and then feed them into CRF to get the corresponding label sequences. The experimental results on both datasets show that the Ra-RC method in this paper is superior to the baseline model.
A follow-up study will focus on how to distinguish entities more accurately with similar text structures. In addition, we will use this method in the following tasks, such as medical relation extraction and medical knowledge graph construction.

Data Availability
We have used the CCKS2017 and CCKS2019 datasets for our experiments. And datasets can be downloaded through the following link: https://github.com/baiyewww/Data.

Conflicts of Interest
The authors declare no conflicts of interest.