A BERT-BiGRU-CRF Model for Entity Recognition of Chinese Electronic Medical Records

. Because of diﬃculty processing the electronic medical record data of patients with cerebrovascular disease, there is little mature recognition technology capable of identifying the named entity of cerebrovascular disease. Excellent research results have been achieved in the ﬁeld of named entity recognition (NER), but there are several problems in the pre processing of Chinese named entities that have multiple meanings, of which neglecting the combination of contextual information is one. Therefore, to extract ﬁve categories of key entity information for diseases, symptoms, body parts, medical examinations, and treatment in electronic medical records, this paper proposes the use of a BERT-BiGRU-CRF named entity recognition method, which is applied to the ﬁeld of cerebrovascular diseases. The BERT layer ﬁrst converts the electronic medical record text into a low-dimensional vector, then uses this vector as the input to the BiGRU layer to capture contextual features, and ﬁnally uses conditional random ﬁelds (CRFs) to capture the dependency between adjacent tags. The experimental results show that the F1 score of the model reaches 90.38%.


Introduction
Named entity recognition is to extract entities with actual meaning from massive unstructured text data [1,2]. In the medical field, medical entities mainly include symptoms, examinations, diseases, drugs, treatments, operations, body parts etc., and are an important part of the establishment of a medical knowledge base. Chinese electronic medical record (CMR) [3] is a combination of structured and unstructured texts, which generally include not only patient information, but also a large amount of medical knowledge, but it is difficult to process. With the development of deep learning technology, entity recognition algorithms have applied research in many fields, but they lack applications in the field of cerebrovascular diseases (CVD) [4].
Cerebrovascular diseases have become one of the most threatening diseases to human health in the world due to the four characteristics [5,6]. e treatment of cerebrovascular diseases highly depends on the doctor's experience. With the increase in the number of patients with CVD, there is a greater demand for cerebrovascular disease physicians. Since the training cycle of professional doctors is relatively long [7,8], it will cause an imbalance in supply and demand of "more patients and fewer doctors". With the introduction of the concept of "AI + Medical," the use of machine learning technology to assist diagnosis and treatment, that is, through the construction of a complex model, the feedback mechanism is used to continuously optimize the parameters of the model then, use the existing clinical data and neuroimaging data in the hospital to diagnose and treat cerebrovascular diseases or predict recurrence. On the one hand, auxiliary diagnosis and treatment decision-making is helpful to improve the professional level of doctors and improve the quality of CVD medical services. On the other hand, it can optimize the uneven distribution of medical resources [9]. At present, the scientific research of machine learning in the field of CVD mainly focuses on two aspects: diagnosis and prognosis prediction of cerebrovascular disease: (1) From the perspective of CVD diagnosis, most scholars use structured data to nest machine learning models to complete disease diagnosis. e literature [10][11][12] established a joint diagnosis model based on logitic regression method and XGBoost machine learning method by collecting clinical data of demographic characteristics. (2) From the perspective of prognosis prediction, the use of machine learning methods for risk prediction has gradually become the trend of disease prediction, while machine learning methods such as random forest, decision tree, SVM, and other machine learning methods have achieved certain research results in the prediction of cerebrovascular diseases. Literature [13][14][15] constructed logistic regression, k-NN, random forest, decision tree, and SVM machine learning models based on follow-up data, and verified the advantages of machine learning models in cerebrovascular disease risk prediction. It shows that the effect of neural network learning score is better. In short, from the analysis of clinical data sources, CVD medical data includes cerebrovascular disease imaging data, follow-up data, electronic medical records, and other data. e focus of most scholars is still on structured data such as follow-up data and neuroimaging data, while the focus on electronic medical record data in the field of CVD is slightly lower. At present, the increase in the number of CVD patients is accompanied by the ever-increasing number of electronic medical records for CVD patients. Electronic medical records can provide scholars with more data resources. For the processing of unstructured text information in electronic medical records, named entity recognition (NER) is a key step, and there are relatively few researches dedicated to named entity recognition in the field of cerebrovascular. e current research on named entity recognition focuses on three aspects: (1) From the perspective of traditional entity recognition methods, traditional methods include methods based on dictionaries and rules [16][17][18][19][20][21][22][23].
is method relies heavily on domain dictionaries and domain experts. e selection of features is done manually, and subjectivity and labor costs are relatively large. With the development of machine learning technology [19,20], more and more scholars are paying attention to models such as conditional random field (CRF), Hidden Markov Model (HMM), Support Vector Machine (SVM), However, NER based on traditional machine learning technology has higher requirements for feature selection [21][22][23], and the quality of feature selection directly affects the effect of entity recognition. (2) From the perspective of deep learning methods: with the development of deep learning technology, literature [24][25][26][27] confirmed the advantages of deep neural network technology by comparing traditional CRF models, that is, deep neural network technology has less artificial feature intervention than traditional methods and can obtain higher accuracy and recall rate. Deep learning can automatically extract word features, reduce the subjectivity of feature selection, and help further improve the accuracy of recognition results. erefore, it is better than traditional statistical algorithms such as CRF and HMM. e common single-entity recognition neural network generally only considers the sample input and lacks in-depth thinking about the output relationship. Based on the idea of model fusion, most scholars usually use LSTM-CRF as the main framework to solve the deficiencies of neural network models. e available literature [28][29][30][31][32] uses traditional word2vec, Glove, and other word vector methods, uses a BiLSTM-CRF model as the core, and adds a CNN model, attention mechanism, RNN model, etc., to the core framework. Furthermore, the word vector undergoes continuous fine-tuning of the parameters, resulting in the final recognition achieving more accurate recognition. e process of parameter tuning is employed to set the hyperparameters of the model. However, for the BiLSTM-CRF model, there are more parameter settings, and the model training time is longer. Literature [26-31, 33, 34] proposed the BGRU neural model, which has simple results and high computational efficiency, can make full use of context information to eliminate entity ambiguity, and has some good effects in the field of entity recognition. (3) From the perspective of pre-training models, the above-described preprocessing models all use traditional word vector methods such as word2vec and Glove. is method focuses on the feature extraction between words and often ignores word context information. In order to improve this problem, as the Google BERT pre-training model is proposed, the literature studies [35][36][37][38] combine the BERT word embedding model on the basis of the traditional BiLSTM-CRF model and consider the polysemy of a word in combination with the context. e P value, R value, and F1 score have all been improved. It can be seen that BERT has strong semantic analysis capabilities.
In order to solve the problems of ignoring context information, low model efficiency, and susceptibility to word segmentation in electronic CVD medical entity recognition processing. We propose a BERT-BiGRU-CRF neural network model to identify named entities in electronic medical records of cerebrovascular diseases. Specifically, the BERT layer first converts the electronic medical record text into a low-dimensional vector, then inputs the vector into the BiGRU layer to capture contextual features, and finally uses a CRF to capture dependency between adjacent tags. e entity extraction model proposed in this paper has achieved good recognition results.

BERT-BiGRU-CRF Model Construction
In the NER field, the use of deep neural network models for entity recognition has become the mainstream. is article uses BiGRU-CRF as a benchmark to extract named entities in the field of cerebrovascular diseases. e reason why the BERT pre-training language model is chosen is that the text vector is used as the input of the model, and the granularity of Chinese division is divided into character-level and word-level. Existing research shows that character-level pretraining schemes show better results [37,39], while the BERT pre-training language model is a character-level pretraining program.
at is, each word in the text is converted into a vector by querying the word vector table, as the model input; the model output is the vector representation combined with the context.
2 Complexity e overall structure of the BERT-BiGRU-CRF model is shown in Figure 1. e model is mainly divided into 3 layers. e first layer is the BERT layer. rough the BERT pretraining language model, each word in the sentence is converted into a low-dimensional vector form. e second layer is the BiGRU layer, which aims to automatically extract semantic and temporal features from the context. e third layer is the CRF layer, which aims to solve the dependency between the output tags to obtain the global optimal annotation sequence of the text.
In this study, the named entity recognition model was used to identify the medical named entity in the electronic medical record of cerebrovascular disease. e specific steps are as follows: (1) EMR data preprocessing, that is, processing the original electronic medical record text data set and express the electronic medical record text set as . . , c m is divided and annotated according to character-level, and the characters and predefined categories are separated by spaces when annotated.
(2) Construct the electronic medical record text training data set. (3) Model training, that is, training the BERT-BiGRU-CRF named entity recognition model. Take the electronic medical record test text collection D test � d 1 , d 2 , . . . , d N as input, and take the entity and its corresponding category pair as output: the entity that appears in the document; h i , b i , and e i , respectively, represent the start and end positions of m i in h i , and no overlap between entities is required, ie e i < b i + 1. C mi represents the predefined category of the entity m i , then calculates the F1 score according to the precision rate and the recall rate, and uses the F1 score as the model comprehensive evaluation index. [40] is an unsupervised and deep bidirectional language representation model for pre-training. In order to accurately represent the context-related semantic information in the EMR, it is necessary to call the interface of the model to obtain the embedded representation of each word in the electronic medical record. BERT uses the deep two-way transformer encoder as the main structure of the model. Transformer introduces the self-attention mechanism and also draws on the residual mechanism of the convolutional neural network, so the training speed of the model is fast and the expression ability is strong. And also abandoning the RNN loop structure, the overall structure of the BERT model is shown in Figure 2.

BERT Pre-training Language Model. Bidirectional Encoder Representation from Transformers (BERT)
En is the coded representation of the word, Trm is the transformer structure, and Tn is the word vector of the target word after training. e operating principle of the model is to use the transformer structure to construct a multi-layer bidirectional Encoder network, which can read the entire text sequence at one time, so that each layer can integrate the contextual information. e input of the BERTmodel adopts the embedding addition method. By adding three vectors, Token Embeddings, Segment Embeddings, and Position Embeddings, the purpose of pre-training and predicting the next sentence is achieved. In Chinese electronic medical record text processing, the semantics of characters or words in different positions have different semantics. Transformer indicates that the information embedded in the sequence of the tag sequence is its relative position or absolute position information, as shown in the following formulae: where P pos is the position of the word in the text, i represents the dimension, and d model is the dimension of the encoded vector. e odd position is encoded using the cosine function. Even positions are coded using a sine function.
In order to better capture word-level and sentence-level information, the BERT pre-training language model is jointly trained by two tasks: Masked Language Model and Next Sentence Prediction. e Masked LM model [36] is similar to cloze filling. 15% of the words in the random mask corpus are marked with the "MASK" form, and then the BERT model is used to correctly predict the masked words. e strategy adopted in the training is that for 15% of the words, only 80% of the words are actually replaced with [mask], 10% of the words will be randomly replaced with other words, and the remaining 10% are unchanged. e Next SP model is to train the model to understand the relationship between sentences, that is, to judge whether the next sentence is the next sentence of the previous sentence. e specific method is to randomly select 50% correct sentence pairs from the text corpus, and 50% randomly select sentence pairs to judge the correctness of the sentence pairs. e Masked LM word processing and Next SP sentence processing are jointly trained to ensure that the information is represented by the vector of each word, so the model is comprehensive and semantically accurate. It fully depicts the characteristics of the character-level, word-level, sentence-level and even the relationship between sentences and increases the generalization ability of the BERT model.

BiGRU
Layer. Gated Recurrent Unit (GRU) [34] gated recurrent unit structure is a variant of long and short-term memory neural network (LSTM). e LSTM structure includes forget gates, input gates and output gates. In traditional recurrent neural network (RNN) training, gradient disappearance or explosion problems often occur. LSTM Complexity only solves the problem of gradient disappearance to a certain extent, and the calculation is time-consuming. e GRU structure includes an update gate and a reset gate, and the GRU combines the forget gate and the input gate in the LSTM into an update gate. erefore, GRU not only has the advantages of LSTM, but also simplifies its network structure. In the task of entity recognition of cerebrovascular disease electronic medical record, GRU can extract features effectively. Its network structure is shown as in Figure 3.
In the GRU structure, the update gate is z and the reset gate is r. e update gate z t is to calculate how much electronic medical record information of the previous hidden layer state h t−1 needs to be transmitted to the current hidden state h t . If z t takes the value [0, 1], it needs to be transmitted when it is close to 1, and the information needs to be ignored when the value is close to 0. e reset gate r t calculation formula is similar to the update gate principle, but the weight matrix is different. e calculation of z t and r t is shown in formulae (3) and (4). First, the electronic medical record data x t input at time t, the state h t−1 of the hidden layer at the previous time, and the corresponding weights The forward GRU

Input vector
The backward GRU BERT layer BiGRU layer  are, respectively, multiplied and added to the σ function.
After the calculation of z t and r t is completed, the content that needs to be memorized at time t can be calculated. Secondly, use the reset gate to determine the hidden state of the electronic medical record at t − 1. e information that needs to be ignored at time t. en, input r t , h t−1 , x t , and use tanh function to calculate the candidate hidden strong state. Finally, h t transfers the cerebrovascular disease electronic medical record information retained in the current unit to the next unit; that is, at time t, the product of z t and h represents the cerebrovascular disease information that the hidden unit h t needs to retain. e product of (1 − z t ) and h t−1 indicates how much information needs to be forgotten. e calculation is shown in formulae (5) and (6) for details: where x t is the input of the electronic medical record of cerebrovascular disease at time t and h t−1 is the state of the hidden layer at the previous time; h t is the hidden state at time t; w is the weight matrix; w z is the update gate weight matrix and w r is the reset gate weight matrix; σ is the sigmoid nonlinear transformation function and tanh is the activation function; h is the hidden state of candidate. From the operating principle of the GRU unit, it can discard some useless information, and the structure of the model is simple, which reduces the computational complexity. However, the simple GRU cannot fully utilize the context information of the electronic medical record. erefore, this paper designs the backward GRU to learn the backward semantics, and the GRU neural network forwards and backwards to extract the key features of the named entity in the electronic medical record of cerebrovascular disease, namely, the BiGRU model. e specific structure is shown in Figure 4. Based on the GRU principle, forward GRU is to obtain the above semantic feature (h t ), and backward GRU is to obtain the following semantic features h t , and finally, the above and the following semantic features are combined to get h t . Refer to formulae (7) and (8) for details: where h → t is the hidden layer state, the purpose is to obtain the above information from the GRU; h ← t is the hidden layer state, the purpose is to obtain the following information from the GRU; G R → U(x t ) means that it is represented by features from front to back; GR ← U(x t ) means that it is represented by the back-to-front feature; and the final hidden layer state of h t is the feature of the electronic medical record report.

CRF Layer.
e NER problem can be regarded as a sequence labeling problem. e BiGRU layer outputs the hidden state context feature vector h, denoted as h � (h 1 , h 2 , . . . , h t ). is vector only considers the context information in the electronic medical record and does not consider the inter-label dependencies. erefore, this paper adds a CRF layer to label the global optimal sequence and converts the hidden state sequence h � (h 1 , h 2 , . . . , h t ) into the optimal label sequence y � (y 1 , y 2 , . . . , y t ). CRF calculation principle [34]: firstly, for the specified electronic medical record input sequence x � (x 1 , x 2 , . . . , x t ), it calculates the score of each location, shown in formula (10). Secondly, calculate the probability of normalized sequence y through the Softmax function, shown in formula (11). Finally, the label sequence with the highest score is calculated using the Viterbi algorithm, shown in formula (12): where A is the transfer score matrix between tags; score (h, y) is the position score; W T yt is the parameter vector; p(y/h) normalized probability function; Y(h) represents all possible tag sequences; and formula (10) is to calculate the score (h, y) of each position in the input sequence from the output feature matrix of the BiGRU layer and the CRF transition matrix.

Training Process.
e process of deep network model training is a process of repeatedly adjusting parameters so that loss reaches a minimum. However, due to the strong learning ability of deep network models, the problem of model generalization is prone to occur. For example, the problem of model under-fitting and over-fitting leads to poor adaptability of the model to new sample data. erefore, regularization methods can generate many models with small parameter values. In other words, such Complexity 5 models have strong anti-interference ability and can adapt to different datasets and different "extreme conditions". It can increase the generalization capabilities of the model during the network training process. e method to solve the problem in this paper is the L2 regularization method, which can avoid the over-fitting problem, that is, adding regularization calculation to the cost function, shown in the following formula: where E in is the training sample error that does not include the regularization term; λ is the adjustable parameter of regularization; and w i represents the weight parameter.

Data Preparation.
e experimental data in this article was obtained from the Ai'ai medical electronic medical records website. e electronic medical record data is a total of 1,300 electronic medical records related to cerebrovascular diseases, which are composed of general patient information, chief complaint, medical history, physical examination, and diagnosis. In addition, this article sorts out the types of entities in the published papers related to named entities in electronic medical records, as shown in Table 1. According to the frequency of occurrence of entities in published literature, electronic medical record entities for cerebrovascular diseases are divided into five entities: disease, symptom, body part, examination, and treatment, which are also proposed by CCKS.

Data Preprocessing.
e electronic medical record information is preprocessed, that is, line breaks and invalid characters, etc., are removed, and 36400 sentences are finally obtained. e testing dataset and the training dataset are divided into 2 : 8. e labeling system used in this article is BIO labeling. e five types of entities are disease, symptoms, body parts, examination, and treatment. erefore, there are 11 labels, namely, O, Disease-B, Disease-I, Body-B, Body-I, Symptom-B, Symptom-I, Examination-B, Examination-I, Treatment-B, and Treatment-I. We conduct named entity labeling with doctors. Among the 300 medical records, we designated two annotators to annotate them at the same time, used Cohen's kappa to calculate the consistency of the annotations, and obtained a kappa value of 0.8. e labels to be predicted are shown in Table 2.

Experimental Settings.
e experimental model in this article is built using tensorflow deep learning framework and Python programming language. e parameter update method is to update the parameters of the BiGRU-CRF model, and BERT is a fixed parameter. Table 3 lists the hyperparameter values of the experimental model in this article. ese values have been modified according to relevant literature [34][35][36][37][38]41] and have not been adjusted for the cerebrovascular electronic medical record data set in this article. e model parameter optimization in this paper adopts the stochastic gradient descent method (SGD), the initial learning rate is 0.015, the update of the learning rate adopts the step decay method, and the decay rate is 0.05. e model has achieved good experimental results in the training set and test set.

Evaluation.
is article uses the most commonly used evaluation index in the field of named entity recognition: precision rate (P), recall rate (R), and F1 score (F1). at is, P is the recognition rate of correctly recognized named entities, R is the rate of correctly recognized named entities in the test set, and F1 is the harmonic average of P and R, which is the comprehensive evaluation index of the model. Among them, the higher the P and R values, the higher the accuracy and the recall rate, but in fact, the two are contradictory in some cases. erefore, the F1 score is often used to evaluate the overall performance of the model. e calculation formula is: where T num entities is the number of correct entities identified; S num entities is the total number of all entities identified; and C num entities is the number of entities in the test set.

Model Performance
Comparison. e comparison model proposed in this paper first uses electronic medical record data for training and then uses a test set for testing. e specific comparison results are shown in Table 4.
From the comparison results of Table 4 and Figure 5, we can see that in terms of comprehensive evaluation indicators. In terms of precision rate, recall rate, and F1 score, the BERT-BiGRU-CRF model proposed in this article has increased by 2.9%, 5.0%, and 3.95%, respectively, compared with the BiGRU-CRF model. e difference between the two models is the embedding of BERT. It shows that BERT embedding can improve the recognition effect of entities. Compared with the BiLSTM-CRF model, the increase was 3.14%, 4.40%, and 4.34%, respectively. Compared with the BERT-BiLSTM-CRF model, the increase was 1.25%, 0.77%, and 1.01%, respectively, erefore, all P, R, and F1 score are improved compared to the baseline model, indicating that the BERT-BiGRU-CRF model is more applicable to electronic medical record recognition in the CVD field. is is mainly due to the stronger ability of embedding BERT to extract features, which enables word vectors to fuse context information. On the other hand, the BiGRU-CRF model can input bidirectional information before and after the sequence, which can effectively avoid entity ambiguity. Figure 6, in terms of entity types, horizontally compares the recognition effects of various entities under different models and compare BiLSTM-CRF, BERT-BiLSTM-CRF, and BiGRU-CRF. In terms of disease entities, they were increased by 9.87%, 2.73%, and 9.63%, respectively; on the symptom entity, they were increased by 1.62%, −0.33%, and 3.29%; on the body part entity, they were increased by 2.85%, 0.45%, and 3.10%. On the examination entity, they were increased by 0.76%, −0.25%, and 0.49%. In terms of the treatment entities, they were increased by 3.8%, 2.46%, and 3.29%, respectively. e overall recognition effect of different entities is compared longitudinally, the effect of checking entity recognition is higher than the comparison model, and the F1 score reaches 90%. However, the recognition effect of entities in the treatment category is relatively poor because the entities are relatively long and cannot clearly identify the boundaries of each entity. In short, the recognition effect of the BERT-BiGRU-CRF model proposed in this paper is higher than that of the control group.

Model Training Time.
Model training is the process of parameter update.
is article analyzes the relationship between the four models in the first 10 rounds of Epoch and F1. It can be seen from Figure 7 that the F1 score of the neural network model without BERT is continuously rising from a lower level, while the F1 score of the neural network model with BERT can be maintained at a higher level, and it takes iterations to reach the optimal F1 score, fewer times. In addition, as a whole, the F1 score of the BERT-BiGRU-CRF model is the highest. From the comparison of training time, Table 5 lists the time required for each model iteration. e training time of the BERT-BiGRU-CRF model for one round is 37 seconds shorter than that of the BERT-BiLSTM-CRF model. is is due to the simple structure of the BiGRU-CRF model and the higher efficiency of the model in calculation. In addition, comparing BiGRU-CRF and BERT-BiGRU-CRF models, it is worth noting that the BERT with full word mask is added to the neural network model, which improves the overall training efficiency of the model. e overall training efficiency is improved.
In summary, the BERT-BiGRU-CRF entity recognition model proposed in this paper has a better recognition effect than the control group. is model can make full use of context information, further avoid ambiguity, and effectively avoid repetition between entities, and the granularity of word segmentation in this article is small, which can improve the accuracy of entity recognition.

Entity Recognition Result.
is paper uses the BERT-BiGRU-CRF named entity recognition model to identify 9393 entities (without deduplication). Among them, electronic medical records have the most descriptions of body parts, followed by symptoms and examination entities, while treatment and disease types are less. e specific results are shown in Figure 8.

Conclusions
Aiming at the text data of electronic medical records of cerebrovascular diseases, this paper proposes a BERT-BiGRU-CRF entity recognition model to identify five key entities in the field of cerebrovascular diseases, which are "disease, symptoms, body parts, examination, and treatment." e model obtains the word vector combined with context information through the BERT layer and then obtains the optimal annotation sequence through the BiGRU-CRF neural network model. It not only guarantees a simple network structure and fast training speed but also can solve the problem of ambiguity in combination with context information. Next, on the one hand, we will study the construction of high-quality dictionaries. On the other hand, we will extract the relationship between different entities based on NER to construct a knowledge map in the field of cerebrovascular diseases, which is conducive to the further potential information of electronic medical records in the field of CVD.

Data Availability
e Chinese electronic medical record data used to support the findings of this study have been deposited in the Ai'ai medical repository.

Conflicts of Interest
e authors declare that they have no conflicts of interest.