Chinese Medical Entity Recognition Model Based on Character and Word Vector Fusion

*e medical information carried in electronic medical records has high clinical research value, and medical named entity recognition is the key to extracting valuable information from large-scale medical texts. At present, most of the studies on Chinese medical named entity recognition are based on character vector model or word vector model. Owing to the complexity and specificity of Chinese text, the existing methodsmay fail to achieve good performance. In this study, we propose a Chinese medical named entity recognition method that fuses character and word vectors. *e method expresses Chinese texts as character vectors and word vectors separately and fuses them in the model for features. *e proposed model can effectively avoid the problems of missing character vector information and inaccurate word vector partitioning. On the CCKS 2019 dataset for the named entity recognition task of Chinese electronic medical records, the proposed model achieves good performance and can effectively improve the accuracy of Chinese medical named entity recognition compared with other baseline models.


Introduction
In recent years, with the significant improvement of informatization, the development of medical and health knowledge Q&A platforms [1], and the widespread application of medical information systems [2], a large number of pieces of biomedical information in the medical field are stored in electronic text in an unstructured form and cannot be fully utilized. For these massive medical information data, it is important to identify and integrate the knowledge entities in them efficiently to build medical knowledge graphs, provide accurate medical knowledge Q&A, and perform medical knowledge reasoning [3].
Named entity recognition (NER) [4] is a vital part of natural language processing (NLP) [5]. Its purpose is to recognize various named entities, for example, names, places, and organizations, from raw text. Extracted entities can pave the way for other NLP tasks, such as relationship extraction [6] and knowledge graph construction [7].
Recently, with the rise of deep learning technology, deep neural networks have been utilized to achieve medical NER and have attracted much research attention. Using named entity recognition technology to recognize entities in the medical text is a fundamental step in transforming medical text into structured data. It also lays the foundation for mining and exploiting the rich knowledge contained in medical texts [8].
At present, Chinese medical entity recognition [9] still faces significant challenges for the following main reasons. First, considering the specificity of medical texts, there is no uniform set of nomenclature for clinical medical language [10]. Different medical personnel record and describe conditions in different ways, the medical field contains more new and uncommon words, and some drugs and symptoms have long and rare names. Second, a public Chinese medical text dataset is lacking due to the difficulty of obtaining and labeling medical texts [11]. Finally, the Chinese language is complex [12]: on the one hand, the Chinese language does not have spaces as separators as in the English language; on the other hand, the Chinese language structure is complex, and there are numerous nested and omitted statements.
For Chinese medical entity recognition, most of the existing studies are based on character vector-based models [13] or word vector-based models [14]. In previous NER researches, the BiLSTM-CRF [15], which is the abbreviation of Bidirectional Long Short-Term Memory (LSTM) [16] joining with a conditional random field (CRF) layer [17], shows advanced performance and has become a prevalent architecture for various NER tasks. is architecture outperforms traditional methods in that it eliminates the inefficient and complex method of manually designing feature templates and utilizes a recurrent neural network (RNN) [18] to automatically capture text features. However, the entities in Chinese medical texts are relatively long and complex. e character vector-based Chinese medical entity recognition models use characters as the granularity of input, but characters do not contain as much semantic information as words.
e word vector-based method segments the Chinese medical text first, but the existing word segmentation tools cannot achieve complete accuracy, which causes error transmission and thus affects the effectiveness of the model. erefore, the performance of a single character vector or word vector-based model will be poor. is paper proposes a method based on character and word vector fusion for Chinese medical entity recognition. Our contributions are summarized as follows: (1) We propose a Chinese named entity recognition method based on character and word fusion that fuses Chinese character vectors and word vectors in the model, and we apply it to Chinese medical entity recognition, achieving good performance. (2) We improve the ELMO model in the input model of the character vector and the word vector, respectively, and get the corresponding input vector, respectively. (3) We propose the vector fusion method for the word vectors obtained at the input after the calculation of the BiLSTM layer. e fused feature vectors can better represent the text information. e remainder of this paper is organized as follows. Section 2 discusses related works; Section 3 presents our framework in detail; Section 4 describes our experiments, results, and discussion; Section 5 concludes this paper.

Related Work
e research on Chinese medical entity recognition began later than its English counterpart. With the influence of foreign medical entity recognition evaluation conferences, related Chinese organizations have also organized conference evaluations for Chinese medical entity recognition. e more influential one is the China Conference on Knowledge Graph and Semantic Computing (CCKS) [19]. e CCKS evaluation tasks have subtasks related to Chinese medical entity recognition, and the China Health Information Processing Conference (CHIP) also has evaluation tasks related to medical entity identification. ese conferences have greatly promoted the research process of Chinese medical entity recognition [20]. e existing research approaches for Chinese medical entity recognition can be classified into three main methods, namely, rule and dictionary-based methods [21], machine learning-based methods [22], and deep learning-based methods [23].
Rule-and dictionary-based methods were applied in early applications of medical entity recognition [24]. e rule-based method mainly relies on the manual formulation of heuristic rules. However, due to the complexity and diversity of entities in medical texts, it is often impossible to enumerate all the rules, which ultimately leads to poor recognition results [25]. e dictionary-based method mainly uses existing medical dictionaries for recognition, and it matches all the entities in the medical text according to the existing dictionaries [26]. However, there are constantly new words appearing in medical texts, and the dictionary cannot contain all medical entities, which causes the failure to recognize named entities outside the dictionary, again leading to poor recognition results [27].
Machine learning-based named entity recognition is ultimately a classification method [28], but there are two ways to perform it. One is to identify all the named entity boundaries in the medical text first and then classify these entities [29]. e second way to carry out machine learningbased named entity recognition is to use the sequence-labeling method, which aims to find a suitable label for a specific sentence sequence. Conditional random field (CRF) is the most commonly used method, which solves the problem of labeling biases by using a global normalization method [30]. Lee and Lu [31] developed a medical named entity recognition model based on CRF and rule-based text attention rules, and the model has achieved good performance in recognizing medical named entities in the admission records of stroke patients. Lei et al. [4] used a 400point admission record and a 400-point discharge summary from the Beijing Union Medical Hospital to conduct medical entity recognition research, in which several machine learning methods, that is, support vector machines [32], maximum entropy [33], and conditional random fields, were used to conduct the named entity recognition research on medical text. e recognition result of the discharge summary had an F-value of 90.01%. Machine learning methods exhibit a good performance improvement over the rule and dictionary-based methods, but due to the characteristics of language, the migration of models between different domains can lead to significant performance problems. So, machine learning-based methods still need a large amount of labeled data to train models, which not only requires a significant amount of human involvement but also has much room for improvement in recognition accuracy.
Whether rule-and dictionary-based or machine learning-based methods, they require not only a great deal of manual labor and time but also a great deal of expertise in the medical field. In recent years, deep neural networks have been applied to natural language processing and have achieved quite good results in named entity recognition tasks [34]. Deep learning methods automatically extract representative features through neural networks, and various neural network architectures have achieved advanced performance in the field of medical entity recognition [35]. Yang et al. [36] combined the BiLSTM network and CRF for medical entity identification in admission records and discharge summaries. Xue et al. [37] introduced the language pretraining model BERT [38] released by Google into joint learning, which greatly improved the feature representation of the shared parameter layer. Yin et al. [39] used a convolutional neural network to extract the information of Chinese character headers and fused the attention mechanism to capture the dependency relationship between characters and finally achieved excellent recognition results.
In summary, medical entity recognition methods based on deep learning are gradually being increasingly used in practical medical entity recognition tasks due to their excellent performance. However, in this paper, we propose a medical entity recognition method that fuses character and word vectors using a deep learning approach. e method considers information of both Chinese characters and word dimensions and enables the model to better represent the information of Chinese text, thus improving the performance of Chinese medical entity recognition models.

Model Structure
e structure of the Chinese medical entity recognition model based on character and word vector fusion proposed in this paper is shown in Figure 1. e input layer consists of two main parts: one is the Chinese character vector as the input layer into the model, and the other is the Chinese word vector as the input layer into the model. e model structure connected after the input layer is the BiLSTM layer [40,41], where the Chinese character and word vectors from the input layer are fed into the BiLSTM layer together, and the output vectors of this layer based on character and word vectors are spliced and operated in the next step. Finally, they are sent to the CRF layer [42] for final entity recognition. e components of each part are described separately as follows.

Input Layer.
e input layer of the Chinese medical entity recognition model based on character and word fusion proposed in this paper consists of two main parts. One is to convert the input text into character vectors; the other is to convert the input text into word vectors. In order to improve the performance of medical entity recognition, we use the pretraining model ELMO [43] to generate the required character and word vectors discussed in this paper.
For the input of the character vector branch, we use a corpus of existing character-level Chinese medical text datasets to generate character vectors corresponding to each character by the ELMO [44] pretraining model. When a sequence c 1 , c 2 , . . . , c n containing n characters is input, the character vector corresponding to each character in the sequence e c 1 , e c 2 , . . . , e c n is sent as input to BiLSTM for subsequent calculation.
For the input of the word vector branch, we first use the word segmentation tool jieba for word segmentation [45], and then, we use a corpus of existing word-level Chinese medical text datasets to generate word vectors corresponding to each word by the ELMO pretraining model [46]. In order to facilitate subsequent calculations and fusion of the model, the dimensionality of the character vector generated by this method is the same as that of the word vector.

BiLSTM Layer.
e basic idea of BiLSTM is to obtain contextual information from the input sequence through two LSTM networks. BiLSTM applies a forward LSTM network and a reverse LSTM network to each training, and the two LSTM networks are connected to the same output layer [47]. e output of the forward LSTM at moment t is . e BiLSTM network structure is shown in Figure 2.
In this paper, we use two independent BiLSTM structures to compute and process the character vector-based input and word vector-based input, respectively, and after the BiLSTM computation, we obtain two sets of feature vectors, that is, character-based feature vectors and wordbased feature vectors, which are sent to the next connection layer for vector stitching and fusion.

Fusion Layer.
e main purpose of the fusion layer is to connect the features extracted from the character-level BiLSTM and word-level BiLSTM and perform feature fusion, so that the final CRF layer can use both Chinese character vector information and word vector information to improve the recognition accuracy when performing the final Chinese medical entity recognition [48,49]. e model in this paper obtains two different feature vectors after the embedding of the input layer and the feature extraction and operation in the BiLSTM layer. e two feature vectors are recorded as character-based feature vectors h cn and word-based feature vectors h wn . In order to explore the best way to fuse character and word vectors, this paper compares the following three fusion methods and finally determines the most effective one for subsequent experiments: (a) Direct addition of character and word vectors. Since the input layer of this model controls the generation of input character and word vectors with the same dimension, the dimension of the obtained feature vectors after the operation of the BiLSTM layer is also the same, and the new feature vectors of the fused character and word vectors can be obtained for subsequent calculations by adding the character and word feature vectors directly. e specific calculation process is shown in (1) as follows [50]:

Scientific Programming
In (1) In (2), h ci represents the output feature vector of the character vector at position i in a sentence through the BiLSTM layer, h wi denotes the word vector of this word in this sentence passing through the output   feature vector of the BiLSTM layer, and con represents the splicing operation of the vector. (c) Use of the feed-forward neural network to weight the character and word vectors. e feature vectors h ci and h wi obtained from the BiLSTM layer of this model are first directly added together, and then, the results obtained are applied to weighting operations using feed-forward neural networks. e specific operation process is shown in (3) as follows [51]: In (3), h ci represents the output feature vector of the character vector at position i in a sentence through the BiLSTM layer, h wi denotes the word vector of this word in this sentence passing through the output feature vector of the BiLSTM layer, and w and b stand for the parameters that must be learned in the neural network.
After the subsequent experimental results are discussed, the experiments are presented in Section 4.4, and the best results can be obtained by directly splicing the feature vectors, so all the subsequent experiments in this paper are conducted in this way to ensure the best results.
Next, we use an example to describe the specific process of character and word vector fusion. For example, with the same input sentence "左下肺感染" in the branch based on the character vector, after the ELMO pretraining model, we obtain five character vectors expressed as e c1 , e c2 , e c3 , e c4 , and e c5 . ese five character vectors are input to the BiLSTM layer for subsequent training to obtain five feature vectors h c1 , h c2 , h c3 , h c4 , and h c5 . In the branch based on the word vector, the input text is segmented first, the results of word segmentation are "左下肺" and "感染," and the two word vectors are expressed as e w1 and e w2 , respectively. After the BiLSTM layer, we also obtain the two corresponding feature vectors h w1 and h w2 .
Next, the two sets of feature vectors obtained are spliced to obtain the final feature vector. After the experiments, we use direct splicing to splice the feature vectors obtained based on character and word vectors. e calculation process is shown in (4) as follows: In (4), h ci represents the output feature vector of the character vector at position i in a sentence through the BiLSTM layer, h wi denotes the word vector of this word in e this sentence passing through the output feature vector of the BiLSTM layer, and con represents the splicing operation of the vector.
After calculation, we obtain the corresponding five final feature vectors h 1 , h 2 , h 3 , h 4 , and h 5 , which are fused with the Chinese character and word information, and the fused feature vectors are sent to the CRF layer for processing to obtain the final medical entity recognition results.

CRF Layer.
LSTM only considers the long-term dependency information of the sentence but not the dependency relationship between tags [52]. For example, in medical entity labeling, B-disease cannot appear after I-disease. CRF can ensure the validity of tags by learning the adjacent relationship between tags. e CRF layer introduces the transition matrix W as a parameter of the CRF layer, and maximum likelihood estimation is used to train the CRF layer. us, for the sentence X, the probability of the model labeling sequence Y � (y 1 , y 2 , . . . , y n ) is In (5) and (6), P i,j represents the probability value of the i-th character classified to the j-th label, and W i,j expresses the state transition score of the i-th leading to the j-th label. X stands for the input sentence (x 1 , x 2 , . . . , x n ), Y stands for the sentence tag sequence (y 1 , y 2 , . . . , y n ), and exp stands for the exponential function of the natural constant e.

Dataset.
e dataset used for Chinese medical entity recognition in this study is from the subtask of CCKS 2019 evaluation task 1 "Medical entity recognition and attribute extraction for Chinese electronic medical records," and the EMRs were manually annotated by a professional medical team. e manual annotation of the medical entity recognition dataset is mainly carried out by the professional medical team. After the medical team members have obtained a unified standard, they annotate in detail with the predefined entity categories, the start index, and the end index of each entity, respectively. e sequence numbers of the characters at the start index and end index need to be counted by the annotator. Each annotation text contains two parts: raw text and annotation information. e annotation information consists of several triples, which are formed of entity start index, entity end index, and entity category. rough entity start and end indices, we can extract the entity from annotated texts. Of course, you can also use some annotation tools, which can directly generate character serial numbers, and it will save a part of manpower. e common annotation tools are YEDDA [53]  e text annotation tool of Jingdong Zhongzhi Wise open annotation platform can be used directly online without downloading, and the operation method and interface are designed to be very friendly.
e dataset used in this study is divided into six predefined medical entity categories, which are disease and diagnosis, imaging examination, laboratory test, operation, drug, and anatomy. e annotated dataset was divided into a training set and a test set. e training set contained 1000 EMRs, and the test set contained 379 EMRs. e statistics for the six types of entities are presented in Table 1.
In this study, we choose the BIO (begin, inside, and output) annotation system to annotate the training and test data, where the specific formats are B-X, I-X, and O. B represents the character at the start position of the medical entity, I represents the character of the remaining part of the medical entity, and O represents the character of the nonmedical entity. X represents the medical entity categories, that is, disease and diagnosis, imaging examination, laboratory test, operation, drug, and anatomy, which are denoted as DISE, IMAG, LABO, OPER, DRUG, and ANAT, respectively. us, these six types of medical entities have a total of 13 different labels in this task, and various types of medical entity labeling symbols and examples are presented in Table 2.
Although the CCKS 2019 dataset used in this paper has been manually annotated by a professional medical team, problems inevitably exist, such as inconsistent entity category labeling and incorrect entity start index and end index labeling, and these errors cause error transfer, which affects the effect of named entity recognition. In this paper, we use manual error correction in the preprocessing stage of the data for the mislabeled entities. e entities found to be incorrectly annotated are manually corrected, thereby improving the effectiveness of entity recognition.

Evaluation Criteria.
We use precision (P), recall (R), and F1 value (F1) to evaluate the performance of the model proposed in this paper. e calculations are as follows: In equations (7)-(9), TP indicates the number of named entities correctly recognized (correctly identified), FP indicates the number of named entities incorrectly recognized (incorrectly identified), and FN indicates the number of named entities not recognized (incorrectly identified).

Parameter Setting.
e main focus of this study is the impact of the method based on character and word vector fusion on the performance of Chinese medical entity recognition models. During the experiments, it was found that changes in certain parameters also had a significant effect on the experimental model results. erefore, the experiments in this paper first explored the effects of some parameters of the model on the experimental results, and then, we explored the research topic of this paper by comparing the experimental results of different models. In this section, we empirically use direct splicing of character and word vectors for all parameter tuning experiments. e number of epochs in a deep neural network requires manual adjustment of the parameters, and different tasks often have different settings for the number of epochs. Among them, too few epochs can cause the model to fail to converge to local minima and appear to be underfitting. On the contrary, if the number of epochs is too large, it will increase the training time of the model and also cause the phenomenon of model overfitting; thus, the generalization ability of the model will be lost. In order to explore the influence of the number of epochs on model performance, the graph of the evaluation index of the proposed model with the number of iterations is shown in Figure 3. As can be seen from the figure, the model in this paper undergoes fast convergence before seven iterations, after which the performance of the model improves slowly with the increase in the number of iterations. After approximately 15 iterations, the model performance changes in a small range with the number of iterations. So, the final number of iterations selected in this paper is 25. e learning rate is also an important parameter that affects the performance of a deep learning model. A large learning rate leads to a nonconverging model, while a small learning rate leads to a particularly slow convergence or failure to learn. In order to investigate the effect of learning rate on model performance, the learning rates of 0.0005, 0.0008, 0.001, 0.003, 0.005, and 0.008 were selected empirically for the experiments.
e experimental results are presented in Table 3, from which it can be observed that the model performance is the best when the learning rate is equal to 0.005. So, the learning rate selected in this paper is 0.005.
After continuous experiments and a comprehensive evaluation of the results, the best parameter values for the experiments presented in this paper were finally determined. e parameters of the character vector-based recognition model and word vector-based recognition model are set to the same values in the experiment, and the specific parameter settings are presented in Table 4.

Fusion Strategy Test.
In order to explore the effect of fusion strategies on the experimental results, this section describes experiments on the three character and word fusion strategies mentioned above separately, after which the best one for subsequent experiments was chosen. e three fusion strategies that must be experimented on are direct addition of character and word vectors (denoted as modeladd), direct splicing of character and word vectors (denoted as model-con), and the use of the feed-forward neural network to weight the character and word vectors (denoted as model-wb).
e experimental results are presented in Table 5. According to the experimental results, the direct splicing of character and word vectors is the most effective among the three fusion strategies. So, this fusion strategy is used in the subsequent experiments.

Experimental Results.
In order to enable the model to achieve better performance, the experimental selection of parameters was carried out with the optimal combination of parameters obtained from the abovementioned experiments, and direct splicing of character and word vector was selected for the experimental fusion strategy. Table 6 presents the precision (P), recall (R), and F1 values (F1) of the six entities after the experiment on the character and word vector fusion-based BiLSTM-CRF model (denoted CW-BiLSTM-CRF) proposed in this paper. From the table, we can observe that the recognition results for different types of entities are different, which is normal, because the number and length of each type of entity are different.
e F1 values on "Imaging examination" and "Drug" are smaller than those on other types of entities. e F1 values on "Imaging examination" are only 79.65%, which may be caused by the small number of "Imaging examination" entities. e poor performance of the "Drug" entity is caused by the small number of entities, and it may be caused by the fact that the name of the entity often contains unusual and proprietary words. "Disease and diagnosis" has the highest number of entity types among the six entities, but its precision, recall, and F1 value are relatively low, probably due to the fact that entities in the "Disease and diagnosis" category are generally long, for example, "慢性肾脏病5期" and "两 肺感染性病变." So, the problem of incorrect boundary prediction arises when predicting these entities, which leads to incorrect entity prediction.

Comparison with Other Models.
In order to evaluate the performance of the model proposed in this article, we conducted further experiments to compare our proposed

Comparison of Convergence Rate.
We monitored the change of F1 values for the five models with the number of epochs. e results are shown in Figure 4, from which it can be seen that the model proposed in this paper obtains higher F1 values at the beginning of training, its convergence speed is faster than the other models, and its F1 values are generally better than the other models. e number of epochs of each model converges quickly in the interval of 0-7 and with the increase in the number of epochs. After approximately 15 epochs, the performance of the model fluctuates in a small range with the number of epochs, and the basic area is stable. Table 7 presents the experimental results of the model proposed in this paper with other baseline models. We can observe that the CW-BiLSTM-CRF model proposed in this paper achieves the best performance among all models. e C-LSTM-CRF model performs better than the C-LSTM model because the addition of conditional random fields provides better sequence features to the model, which improves the performance of the model. e C-BiLSTM-CRF model performs better than the LSTM-CRF model based on a character vector model because the BiLSTM can capture contextual information from the input sequences, thus improving the performance of the model.       e model proposed in this paper improves the accuracy by 1.19% and the F1 value by 0.53% compared to the C-BiLSTM-CRF model. e performance of this model is also improved compared to that of the W-BiLSTM-CRF model. e medical entity recognition model proposed in this paper that fuses both character and word information can better represent the information, and the performance is improved to a certain extent compared to that of the previous models that use a single character vector or a word vector.

Comparison of Each Category.
For further comparison, we experimented with six categories of entities separately. e final results of the experiments are presented in Table 8. From the results in the table, it can be noted that the accuracy of the model proposed in this paper is the highest, except for "Disease and diagnosis" and "Operation." While the F1 value that is not the highest is only "Imaging examination," its overall result is slightly lower than that of the C-BiLSTM-CRF model. e model proposed in this paper combines character and word vectors, and it needs more data to better extract the information in medical texts.
Owing to the small number of "Imaging examination" entities, the model proposed in this paper cannot perform well on them. However, it performs very well on the rest of the entities and is able to recognize medical entities more accurately, which leads to the conclusion that our model still has the best overall recognition effect compared to the rest of the models.

Conclusions
In this study, we propose a Chinese medical entity recognition model based on character and word vector fusion that has a significant impact on improving the accuracy of Chinese medical entity recognition. e key to this model is to express the Chinese text as character and word vectors separately in the input layer, then fuse the two feature vectors obtained after the BiLSTM layer, and use the fused feature vectors to obtain the final recognition results. is approach avoids the problems of missing information of character vectors and inaccurate word separation of word vectors. On the CCKS 2019 dataset named entity recognition task for Chinese EMRs, the model proposed in this study achieves better performance than other baseline models.
In the future, we will explore the effectiveness of the character and word vector fusion strategy in other advanced models, so as to further improve the performance of the Chinese medical entity recognition model based on character and word vector fusion. Alternatively, by incorporating information from other dimensions, the model can realize better recognition performance in the face of longer entities; the fused character and word vectors model can also be applied to datasets from other domains to verify its generality.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.