Chinese Clinical Named Entity Recognition with ALBERT and MHA Mechanism

Traditional clinical named entity recognition methods fail to balance the effectiveness of feature extraction of unstructured text and the complexity of neural network models. We propose a model based on ALBERT and a multihead attention (MHA) mechanism to solve this problem. Structurally, the model first obtains character-level word embeddings through the ALBERT pretraining language model, then inputs the word embeddings into the iterated dilated convolutional neural network model to quickly extract global semantic information, and decodes the predicted labels through conditional random fields to obtain the optimal label sequence. Also, we apply the MHA mechanism to capture intercharacter dependencies from multiple aspects. Furthermore, we use the RAdam optimizer to boost the convergence speed and improve the generalization ability of our model. Experimental results show that our model achieves an F1 score of 85.63% on the CCKS-2019 dataset—an increase of 4.36% compared to the baseline model.


Introduction
Clinical named entity recognition (CNER) is a fundamental and crucial task in medical natural language processing problems. Researchers aim to identify and extract the clinical entity mentioned in electronic medical records (EMRs) and classify them into predefined categories (e.g., disease, symptom, and treatment). Additionally, extracting named entities from semistructured or unstructured EMRs is helpful for further research, such as building clinical decision support systems and medical knowledge graphs.
Recent developments of deep learning (DL) have led to their overwhelming performances in the field of natural language processing. At the same time, researchers have adopted DL methods on biomedical tasks [1][2][3][4]. Compared with traditional rules and dictionary-based methods or machine learning (ML) methods [5][6][7], DL methods have the advantage of stronger generalization ability and less reliance on rule design or feature engineering. In particular, the bidirectional long short-term memory with conditional random field (BiLSTM-CRF) method [8,9] has achieved significant results in CNER [10][11][12]. However, the wordlevel BiLSTM model cannot solve the problem of error propagation caused by the wrong entity boundary recognition, nor can it make full use of the parallelism of the graphics processing unit (GPU). Also, the entities in Chinese EMRs have a unique and rigorous language structure [13], which makes Chinese CNER more challenging.
To solve the above problems, Strubell et al. [14] proposed an iterated dilated convolutional neural network (IDCNN) model for named entity recognition, which simultaneously improved training speed and accuracy. Gao et al. [15] used an attention-based IDCNN-CRF model for the CNER task and demonstrated the effectiveness of combining word order features and local context. However, this approach does not effectively integrate the contextual semantic information of a sentence, nor does it accurately represent polysemous words. Li et al. [16] proposed the BERT-BiLSTM-CRF model, which incorporated dictionary features and radical features of Chinese characters to improve model performance. However, the model's stringent requirements for the quality of dictionary and storage space limit its performance in actual scenarios. Fang et al. [17] developed an end-to-end neural network based on a multi-head attention (MHA) mechanism and two hint mechanisms for the joint extraction model of entities and relations. e model outperformed the state-ofthe-art methods of joint entity and relation extraction.
For the Chinese CNER task, we propose the ALBERT-IDCNN-MHA-CRF model. is paper's main contributions are as following: (1) We fine-tune the ALBERT pretraining model to enhance the semantic representation. (2) We use the IDCNN model to encode the global information of the entity and speed up the training process. (3) We use a multi-head attention mechanism to capture the context information. (4) We use the RAdam optimizer to boost the convergence speed and improve the model's generalization ability. (5) e evaluation results show that our model achieves good performance on the CCKS-2019 datasets.

Related Work
At present, the methods for the CNER task are divided into three categories: rule-based and dictionary-based methods, ML-based methods, and DL-based methods [18]. Rule-based and dictionary-based methods have been mainly used in the early CNER system and related applications. ey rely only on existing dictionaries and manually constructed rules, which cause problems of long system development cycles and poor portability for complex and diverse entities in EMRs. In contrast to the above methods, the ML-based method has good versatility, which regards the CNER task as a sequence labeling problem and uses a largescale corpus to label each position of the sentence. Classical ML methods such as the hidden Markov model, maximum entropy Markov model, support vector machine, and CRF are widely used in the CNER task. Nevertheless, constructing a large-scale labeled corpus in the early stage is costly, and the high dependence on manual feature engineering is timeconsuming.
Recently, methods based on DL have been successfully applied to the CNER task. e BiLSTM-CRF method achieved the most advanced performance on many CNER datasets. However, the time series-based calculation in the LSTM model could not achieve efficient parallelism, and it is challenging to capture the long-term dependence between characters in the face of long sentences. For large-scale electronic medical record corpora, there have been problems with high model complexity and slow training speed. erefore, researchers have attempted to use the CNN method to effectively capture contextual semantic information while taking full advantage of GPU parallelism to improve the model efficiency.
Unfortunately, the above DL-based methods failed to distinguish ambiguous characters or words. For example, the character "清" (clean) has completely different meanings in the two sentences of "患者神志清、精神可" (the patient is conscious and in good spirits) and "于我院行淋巴结清扫术" (lymph node dissection in our hospital), but they would be mapped to the same vector in static word embedding representation methods (such as Word2Vec). So, it could not consider the contextual semantic information of the sentence.
In recent years, many pretrained contextual word embedding models have been proposed, such as EMLo and OpenAI-GPT. However, the above two pretraining models cannot simultaneously obtain the semantic information of the EMRs in the front and back directions. Bidirectional encoder representations from transformers (BERTs) solve the above problems well. For the CNER task, we only need to set the downstream task interface and use the relevant data to fine-tune the model to obtain a more accurate embedded representation of each word in the EMRs. Cai [19] first enhanced the semantic representation of characters through BERT, further inputting the word embedding into BiGRU-CRF for training, and finally achieved better performance. Zhang et al. [20] pretrained BERT on the corpus of Chinese clinical text and used the embedding as input features of BiLSTM-CRF to solve the breast cancer CNER problem, and achieved an F1 score of 93.53%.
BERT had excellent performance in CNER, which mainly benefited from its "overparameterized" nature. Owing to its millions or even billions of parameters, its computational efficiency is low, which greatly hinders its application in actual CNER systems. erefore, researchers have begun to study on compressing BERT's size with an acceptable tradeoff on performance to speed up the training progress. Sun et al. [21] outlined a "patient knowledge distillation" method by compressing the model into a lightweight shallow network. Fan et al. [22] proposed LayerDrop, a structured dropout method, to train the transformer model. Without fine-tuning, they sampled subnetwork from the original model through a pruning strategy to generate a high-quality small BERTmodel. Shen et al. [23] proposed a new group-by-group quantization scheme and compressed the model with Hessian-based mixed-precision quantization. e ALBERT model proposed by Lan et al. [24] applied two parameterreduction techniques to reduce memory consumption and improve the training speed of BERT while using a self-supervised loss to improve the training effect. Figure 1 illustrates the overall ALBERT-IDCNN-MHA-CRF network architecture of our model. First, for each Chinese character of EMRs, the character, sentence, and position features are computed by ALBERT. Second, we concatenate the three embeddings and feed them into the IDCNN network to extract the global features, and then input the embeddings to the MHA layer to capture the long-distance dependencies between characters by calculating the attention probability of sentences from multiple aspects. Finally, we concatenate the output vector of the MHA layer into a 2

Materials and Methods
Evidence-Based Complementary and Alternative Medicine CRF layer, which constrains the dependency relationship between the prediction labels and obtains the best label sequence. To improve the generalization ability of the model, we add a dropout layer between the embedding layer and the IDCNN layer.

Embedding.
Language modeling is a key concept in natural-language processing tasks. While BERT enjoys an outstanding performance in CNER, its overparameterization leads to a large memory footprint and time consuming. Compared with BERT, ALBERT has mainly made improvements in three aspects: factorized embedding parameterization, cross-layer parameter sharing, and intersentence coherence loss, which remarkably reduces the total number of parameters and reduces the model's complexity.
For each word in the EMRs, the input representation of ALBERT consists of three parts: token embedding, segment embedding, and position embedding. Token embedding represents a word vector that can be either a word vector or a character vector in the Chinese language. Owing to the unique sublanguage characteristics and complex language structure of EMRs, we use character embedding for representation. Segment embedding is used to distinguish pairs of sentences.
where pos is the position in EMR, i is the dimension, and d model is the vector dimension after encoding. Figure 2 shows an example of this input.

IDCNN.
To effectively extract the text features of EMRs, while speeding up the training process and improving prediction efficiency, this study uses the IDCNN model for feature extraction. Dilated convolution was originally applied in the field of image processing. Unlike traditional CNNs, it uses the dilation width between the convolution kernels without the pooling operation to reduce information loss and increase the receptive field. e receptive field of dilated convolution is calculated as Here, we use four identical blocks of dilated convolution. Each block has three dilated convolution layers with dilation widths of 1, 1, and 2. us, there are four iterations, where each iteration takes the previous result as the input. is parameter-sharing mechanism effectively prevents overfitting. As the number of layers increases, the receptive field increases exponentially, while the parameters increase linearly so that the receptive field quickly covers all input sequences. In the IDCNN model, the parameters of each layer are independent and of the same scale, which effectively reduces the parameters during training, and thereby speed up the training.  T1  T2  T3  T4  T5  T6  T7 [CLS] [CLS] [SEP] [SEP] Evidence-Based Complementary and Alternative Medicine distance semantic features, these features share the same weight and cannot solve the problem of different correlations between characters. Hence, further feature extraction is required through the multi-head attention layer.

MHA.
Since the entities in EMRs do not exist in isolation, there are specific dependencies between each other, accompanying a long interval between the characters of the entity. For example, in the sentence "患者因胃癌于2015-5-19于我院行胃癌根治术, 术后恢复良好" (the patient underwent radical gastrectomy for gastric cancer in our hospital on May 19, 2015, and recovered well after the operation.), "胃癌"(gastric cancer) belongs to the disease entity, and "胃癌根治术" (radical gastrectomy for gastric cancer) represents the operation entity. ese two entities often appear in the same EMR, suggesting a certain dependence between them.
To capture this dependency, the model has to pay more attention to the characters dependent on the current character and assigns higher weights to these dependent characters and smaller weights to other irrelevant characters so as to recognize the entity type of the character better.
Here, we pick the MHA model for multiple self-attention calculations in order to learn relevant information in different representation subspaces. e MHA model also ensures parallel computing performance superior to recurrent neural networks. Scaled dot-product attention in the model is defined as where the query Q, key K, and value V are all in vector form, 1/ �� d k is the k-dimension adjustment smoothing term, and the softmax function value is the normalization factor. We set Q � K � V when calculating self-attention, which represents the characters in the sentence.
In the CNER task, for an input sentence X � (x 1 , x 2 , . . . , x n ), the output after IDCNN layer is Y � (Y 1 , Y 2 , . . . , Y n ). For the output state Y t of the t-th character in the sentence, the single-head self-attention calculation is performed using formula (4). A total of h calculations are performed, and the result of the i-th calculation is head i . where After concatenating the calculation results of these h times and performing a linear transformation, the result of the t-th character in the sentence is obtained, which is given by where concat is the splicing function and W O ∈ R hd v ×d model is the weight parameter.

CRF.
e output of the MHA layer is the probability or score of each label corresponding to each character in the sentence. Denote the scoring matrix by P. If the label is modeled and output independently, the dependency between labels is ignored (for example, the "I-CHE" label cannot be immediately followed by the "B-DIS" label), which is essential information for the decoding module. erefore, we introduce the CRF layer for label decoding, which constrains the dependency relationship between predicted labels to decode the global optimal label sequence.
For a given input sequence X � (x 1 , x 2 , . . . , x n ), and the corresponding label sequence y � (y 1 , y 2 , . . . , y n ), let W be the transition matrix, the evaluation score is defined as where P i,j is the score of the i-th character labeled as label j, and W i,j is the state transition score from label i to label j. Given X, the conditional probability of the sequence label y is calculated through the softmax function: where Y x is all possible label sequences of sentence X. During training, we maximize the log likelihood of the correct label sequence:  While decoding, we predict the sequence of labels with the highest conditional probability and use the Viterbi algorithm to decode the optimal label sequence. Here, we represent the entities with "BIO" (B-begin, I-inside, O-outside) tags in the following formats: B-X, I-X and O. B represents the starting position of the medical entity, I represents the remaining part of the medical entity, and O represents the nonmedical entity. X is the type of medical entity, which could be DIS, EXA, TES, OPE, DRU, and ANA.

Experimental Settings.
Each clinical record may contain several sentences, leading to a too-long sample if we treat a record as a whole. Hence, we separate each record by a period to restrict the sentence length. After cutting the records, we set the maximum sequence length to 128. e IDCNN consists of 128 filters and the number of heads in MHA is 4. During training, we use the back-propagation algorithm and Adam optimizer with an initial learning rate of 3 × 10 −5 . e word embedding size is 128, and the activation function is ReLU. Also, the batch size is 20 and the dropout rate is 0.5.  Table 2 lists the experimental results of the different models. e experimental results show that our model's precision, recall and F1 score reach the highest values among the counterparts, with an increase of 3.67%, 3.15%, and 3.42%, respectively, from the baseline model, verifying the effectiveness of our model. e F1 scores of BiLSTM-CRF and IDCNN-CRF models are 81.27% and 81.49%, respectively, indicating that the recognition effects of the two models are equivalent. However, the 21seconds-shorter per epoch running time demonstrates a better parallel computing power of IDCNN than BiLSTM. After adding the MHA layer, the F1 score increases by 0.99% (compared to 81.49%) and 1.33% (compared to 83.36%), respectively, which outlines the MHA's ability on extracting the contextual features. Also, replacing the traditional word vector model with fine-tuned ALBERT improves the F1 score by 1.87% (compared to 81.49%) and 2.21% (compared to 82.48%), respectively. is result has further strengthened our confidence that ALBERT has better semantic representation ability and has a more significant impact on the performance of the CNER task.

Results and
In addition to observing the evaluation metrics of the test dataset, we take a closer look at the predicted results. Figure 3 reports the performance of the proposed model on different types of clinical entities. e plot reveals that the   Evidence-Based Complementary and Alternative Medicine model performed well on drug and exam, reaching F1 scores of 92.62% and 89.66%, respectively, but fails to identify disease and anatomy effectively. After observing the errors, Table 3 lists the representative errors. First, these two types of entities are generally long, and supplementary information is in parentheses. For example, "(左肝)肝细胞性肝癌(中度分 化)" ((left liver) hepatocellular carcinoma (moderately differentiated)), "腹主动脉旁淋巴结" (abdominal para-aortic    lymph nodes). erefore, when predicting this type of entity, there is a problem with boundary prediction errors, which leads to entity recognition errors. Second, some disease entities and operation entities are similar in text structure or nesting phenomena, resulting in the misclassification of this type of entity. As an illustration, among "胃癌" (gastric cancer) and "胃癌根治术" (radical gastrectomy for gastric cancer), the former belongs to the disease entity, while the latter is an operation entity. ird, the complex features of the two types of entities complicate the recognition.

Comparison of Different Optimizers.
We run the above experiment with Adam optimizer. Furthermore, we explore the influences of the Adagrad [25], RMSprop [26], Lookahead [27]+Adam, and RAdam [28] optimizers on entity recognition. Table 4 presents how each optimizer improves the learning rate and gradient.
Applying the above optimizers to our model, Table 5 shows the experimental results, and Figure 4 shows the accuracy rate changes.
e results identify that combining the dynamic adjustment of the gradient components is better than the one of dynamically adjusting the learning rate. Compared with the Adam baseline method, the performance is slightly improved after adding Lookahead, and its convergence speed is faster, which verifies the effectiveness of its exploration and integration strategy. We obtain the best model with the RAdam optimizer, whose F1 score reaches 85.63% and has an increase of 0.94% compared to Adam. e dynamic rectifier in RAdam adjusts Adam's adaptive momentum according to the variance and provides an automatic warm-up mechanism with regard to the dataset.  Evidence-Based Complementary and Alternative Medicine 7 Table 6 lists the test results of the other methods on the CCKS-2019 dataset [29]. e DUTIR team used the ELMO model to learn the contextual embedding representation of characters; then, it identified medical entities through the BiLSTM-CRF network; furthermore, it improved the model performance through transfer learning. e THU_MSIIP team used multiple different types of deep neural network models to complementarily introduce multiaspect information and used a postprocessing model based on dictionaries and context models to supplement. e Alihealth team proposed a method based on BERT and model fusion and constructed a series of rules through frequent pattern mining. However, the weak generality of those rules limited the scope of application. With RAdam optimizer, we achieve the best performance with an F1 score of 85.63%, and outperform other teams.

Influence of the Number of Heads in MHA.
e MHA layer can extract features from multiple aspects as different head can extract different features. To explore the influence of the most important hyperparameter on our model, recall that h is the number of heads, we set its value to 1, 2, 4, 8, and 16, respectively. We illustrate the results in Figure 5. Figure 5 highlights the impact of h, where the performance improves as h increases from 1, since the text features are not fully extracted when h is small. On the other hand, the model learns too much redundant information when h is too large, harming the entity recognition. erefore, by exploiting the value of h, we obtain the optimal performance when h is equal to 4.

Conclusions
is paper proposes a named entity recognition method, ALBERT-IDCNN-MHA-CRF, for the Chinese CNER task.
e ALBERT pretraining language model more accurately represents contextual semantics in EMRs. Encoding entities through IDCNN achieves better recognition results, and the training speed has been improved. MHA captures rich semantic information in sentences. Furthermore, the RAdam optimizer benefits the performance. e proposed model achieves an F1 score of 85.63% on the CCKS-2019 dataset, superior to the state-of-the-art models. In future work, we will enrich the semantic representation of the embedding layer and introduce other features into the model. We will also consider the impact of nested entities to predict the boundaries of entities more accurately, thereby improving the overall entity recognition effect.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.