Research on Named Entity Recognition Based on Multi-Task Learning and Biaffine Mechanism

Commonly used nested entity recognition methods are span-based entity recognition methods, which focus on learning the head and tail representations of entities. This method lacks obvious boundary supervision, which leads to the failure of the correct candidate entities to be predicted, resulting in the problem of high precision and low recall. To solve the above problems, this paper proposes a named entity recognition method based on multi-task learning and biaffine mechanism, introduces the idea of multi-task learning, and divides the task into two subtasks, entity span classification and boundary detection. The entity span classification task uses biaffine mechanism to score the resulting spans and select the most likely entity class. The boundary detection task mainly solves the problem of low recall caused by the lack of boundary supervision in span classification. It captures the relationship between adjacent words in the input text according to the context, indicates the boundary range of entities, and enhances the span representation through additional boundary supervision. The experimental results show that the named entity recognition method based on multi-task learning and biaffine mechanism can improve the F1 value by up to 7.05%, 12.63%, and 14.68% on the GENIA, ACE2004, and ACE2005 nested datasets compared with other methods, which verifies that this method has better performance on the nested entity recognition task.


Introduction
Named entity recognition tasks are mainly studied for flat entities and nested entities. In the process of many named entity recognition tasks (e.g., GENIA [ref], ACE2004 [ref], and ACE2005 [ref]), many entities may be nested, that is, there are one or more other entities inside an entity. As shown in Figure 1, the sentence "Note to exclude tuberculosis" contains only the flat entity "tuberculosis", while the entity "colon cancer" in the sentence " e patient has colon cancer" also includes the entity "colon" to form a nested form. For named entities with nested structure, due to their complex hierarchical structure, the traditional named entity model based on sequence labeling is difficult to deal with directly and effectively. erefore, increasingly researchers began to pay attention to the problem of nested named entity recognition and proposed some models especially suitable for the task of nested named entity recognition.
Sequence-based methods utilize traditional sequence labeling methods to learn nested structures. Ju et al. [1] proposed a stacked LSTM-CRF model to predict nested named entities by dynamically stacking flat NER layers to identify nested entities. Katuyar and Cardie [2] used their proposed recurrent neural network-based method to handle nested named entity recognition. Lu and Roth [3] introduced a hypergraph structure for learning nested named entities. Wang and Lu [4] further proposed a neural segmentation hypergraph to address the problem of nested entity recognition. Span-based methods are another advanced method for unified named entity recognition. e idea of this class of methods is to enumerate all possible spans and classify them. e span model of Li et al. [5] introduces a general framework by several information extraction tasks that share span representations using dynamically constructed span graphs. Sohrab and Miwa [6] enumerate all possible regions of a latent entity or span, and classify them with deep neural networks. Yu et al. [7] proposed the idea of graph-based dependency parsing to provide the model with a global view of the input through biaffine mechanism, scoring pairs of start and end tokens in a sentence. Use this tag to explore all spans so that the model can accurately predict named entities.
However, these research methods have different problems. e traditional sequence labeling method identifies nested entities layer by layer. And the error of the inner entity recognition will directly lead to the wrong identification of the outer entity, which will lead to the problem of error propagation of nested entities. When the input sentence is too long or there are many entity categories in the hypergraph-based model, the hypergraph structure will become complex, resulting in difficulty in parameter optimization. e span-based method first identifies the head and tail spans of entities, constructs head-tail entity pairs, and then performs label classification. Construct head-tail pairs based on real labels during training, and predict which words are head-tail pairs during testing.
is method is easy to detect nested entities in different subsequences. However, due to the emphasis on learning head and tail representations, the model lacks obvious boundary supervision for entities and lacks effective use of entity boundary information. And more entity words are not predicted, which makes the model have the problem of high precision and low recall, thus affecting the overall recognition effect. In addition, when the entity span is too long, the interactive information between the head and tail spans of the entity will gradually decay. And the problem of information interaction between the head and tail spans is also ignored to a certain extent, which affects the recognition effect. erefore, in view of the various problems raised above, this paper proposes a named entity recognition model based on multi-task learning and double affine mechanism: (1) To enhance boundary supervision, in addition to using the biaffine model to classify the learned head and tail spans, the model adds an additional boundary detection task to predict words as entity boundaries.
(2) e model captures the connection between adjacent words according to the context, trains the two tasks jointly under the framework of multi-task learning, and enhances the span representation through additional boundary supervision.
(3) e boundary detection module helps to generate high-quality span representations, more entity words are correctly predicted, and the recall rate of the model is improved, thereby improving the overall effect of the model.

Materials and Methods
In this work, we propose a model using Multi-Task Learning and Biaffine Mechanism (MTL-BAM). In this model, through multi-task learning, a multi-task loss is applied to simultaneously train two parts, the boundary detection module and the entity span classification module. e MTL-BAM model consists of an embedded representation module, a shared feature representation module, and a multi-task learning module. e specific model structure is shown in Figure 2. e input of the model is a sentence, and the output is the entity in the sentence and the category corresponding to the entity.
Next, the research content and implementation process are described in three parts: the design of the embedding representation module, the shared feature representation module, and the multi-task learning module.

Embedding Representation Module.
In the embedding representation module, in order to obtain the features of the input text more comprehensively, three embedding methods of BERT, CharCNN, and FastText are used. e BERT [8] method can obtain the contextual features of the sentence; the CharCNN [9] method can obtain the character-level text features; the FastText [10] method can obtain the word-level features of the sentence. Next, the three embedding methods are described in detail.

BERT Embedding.
BERT passes the words in each input sentence through the word embedding layer and converts them into vector representations. To keep the size of the vector dimension the same, padding operations are performed on training texts of different lengths before word embedding, and the length of each training text is become the same by learning from each other's strengths. In addition to word embedding, the input of BERT also contains two embedding layers: one is sentence embedding, which is used to distinguish whether the current word belongs to sentence A or sentence B; the other is position information embedding, which obtains the relative position information of the sentence context and expresses the sequence order in which each word appears in the sentence. e input of BERT consists of the summation of the above three embedding vectors, and the structure is shown in Figure 3.  representation that carries effective information. X lm t represents the contextual embedding vector representation of the pretrained sentence at time t.

CharCNN Character Embedding.
e character embedding layer uses the CharCNN network encoding to map words into character-level vector representations. e specific method is: constructing the input text into a character encoding, sending it to a one-dimensional CNN model, and outputting it with a specific width after one-dimensional convolution. Perform max pooling to obtain character vector representations with specific dimensions. After the processing of the CharCNN network, the character-level vector representation X char X is obtained, which represents the character embedding vector of the sentence at time t.

FastText Word Embedding.
e word vector model can map the sentence into a word-level vector representation. is paper uses the word vector pretrained by FastText to obtain the word representation of the sentence, where X word t represents the word vector representation of the sentence at time t.
After obtaining the character vector, word vector, and context representation, the mapped results are spliced and sent to the next network. For a sentence consisting of t tokens, output a sequence vector, the vector is shown in the following formula: Among them, t represents the current time.
[; ] represents the connection, and X lm t ， X char t ， X word t , respectively,  [11]. e shared feature representation module first obtains BERT and other output vectors through the BiLSTM layer, to obtain more comprehensive semantic information. LSTM can effectively solve the phenomenon of vanishing gradient or exploding gradient in recurrent neural networks [12]. BiLSTM is composed of forward LSTM and backward LSTM.
e LSTMs in the two directions are connected in series to obtain bidirectional word vector information.
is paper adopts the BiLSTM structure to model contextual information. BiLSTM simulates the context-time interaction of sentences after obtaining the embedding vector of the embedding layer. For each sentence, the left-to-right and right-to-left order representations are computed separately.

Head and Tail Span Representation.
e head and tail span table module is based on the output of the BiLSTM layer. Use two MLPs on each hidden layer before going to the next layer, creating two different representations (s, e) as the start and end of the entity span. S carries information that identifies the head of the entity, e carries information that identifies the tail of the entity, and other redundant information is removed. e two MLP layers learn the head and tail representations of the span and the MLP layer is set to a lower dimension, so as to alleviate the overfitting phenomenon generated by the output of the BiLSTM network and obtain more features in the text. e encoded representation is input into the classifiers of MLP and softmax, which detects whether the word is the beginning or the end of an entity, and generates a span representation carrying entity head and tail information, respectively. s(t) and e(t) are the head and tail span representation of the entity, respectively. eir expressions are shown in formula (2) and (3), where h t represents the hidden layer output of BiLSTM, and MLP s and MLP e represent two multilayer perceptrons processing head and tail information, respectively. For (s(t), e(t)) such token pairs, feed each such token pair to the underlying network for the associated task.

Multi-Task Learning Module.
ere are two subtasks in the MTL-BAM model: entity boundary detection module and entity span classification module. e two tasks are described as follows.

Entity Span Classification Task.
e model used in the entity span classification task is a biaffine mechanism. e biaffine mechanism is different from the traditional MLP mechanism, using a biaffine attention mechanism instead of bilinear, using a bilinear layer instead of two linear layers and one nonlinear layer, simpler than the traditional MLP networks. After obtaining (s, e) above, input the biaffine network to obtain a score matrix. Figure 4 shows the entity span matrix constructed for the entity span classification task. In "damage to the respiratory center", "damage" is the beginning of the entity, and "center" is the end of the entity. e constituted entity goes through the following formula to calculate all scores of the entity type contained in the current data set, and the entity type with the highest category score is clinical manifestation, then "damage to the respiratory center" is identified as the clinical manifestation entity type. is is a fixed-category classification problem, and the prior probability of the head and tail spans needs to be considered at the same time. It is known that words such as head and tail are the posterior probability of a certain category relationship, as shown in the following formula: For entity fragment i, r m provides the score that the current entity fragment can constitute a named entity category, in the case of restricting the entity's start position before the end position. Among them s(i), e(i) represents the head and tail representation of the i segment, and ⊕ represents the concatenation of vectors. s(i) T represents the transpose of the s(i) vector. U (1) indicating the posterior probability of the current word being the head and tail entity category at the same time, U (2) indicating the posterior probability of the current word being the head or tail entity category, b represents the prior probability of not knowing what entity class it is. To determine the entity category spanning the head and tail of an entity, r m (i) represents the category score of all possible fragments that currently constitute the named entity. en, get the category with the highest score as the category predicted for each span, as shown in the following formula: After predicting the category of the entity segment, the span of all entity categories is arranged in descending order, and the postprocessing protocol is adopted: for nested entities, it is judged whether there is partial overlap between different entities, and if there is partial overlap, the entity with the highest score is retained. For the ith and jth entities, s i and s j , respectively, represent the starting position of the entity, e i and e j , respectively, represent the end position of the entity. If the partial overlap satisfies the case of s i < s j < e i < e j , the highest-scoring entity and its category are reserved.
e learning objective of entity span classification is to assign a correct class (including nonentities) to each valid interval. erefore, it is a multiclass classification problem, and the model is optimized with Softmax cross-entropy. e loss of the model is shown in the following formulas: 4 Computational Intelligence and Neuroscience Among them, N represents the length of the sentence, C represents the number of entity label types, y represents the actual label type of the current word. y ic is 1 if the current category is c, 0 otherwise. p m (i c ) is the output of the neural network, that is, the probability that the category is c. is output value is calculated using the Softmax mentioned above. Finally, the loss loss b of the span classification module is obtained.

Entity Boundary Detection
Task. When using the biaffine mechanism for entity span classification, the introduction of head and tail span information can easily identify nested entities. However, the span learning of head and tail makes the model lack clear boundary supervision, which reduces the number of accurate candidate entities learned and reduces the effect of entity recognition. Even for nested entities, the recognition of head and tail spans does not make the connection between internal and external entities, and the lack of accurate external boundary information will cause internal entity recognition errors and reduce the recognition effect. erefore, a multi-task learning method is introduced, and a boundary detection module is constructed to assist entity category prediction. e boundary detection model is shown in Figure 5. After obtaining the head and tail span representation, the shared feature representation of multi-task learning is used as input, and the ReLU activation function and Softmax classifier are input to predict boundary labels, and the training speed is faster because Softmax also incorporates the mutual exclusion information between classes. e calculation process is shown in the following formulas: For each token in the sentence, here U and b are trainable parameters. s and e represent the span representation of header information and tail information, and "," represents the concatenation operation of vectors. d(t) is the calculation result of the Softmax network layer, indicating the probability that the current is "O" and "I". We compute the loss formula (10) between the true distribution d(t) and the predicted distribution d(t): Since the model shares the same entity boundary when performing entity boundary detection and entity class judgment, the losses for the two tasks of entity boundary detection and entity span classification are jointly trained. In the training phase, the real entity boundary labels of the data are input into the model to train the entity boundary detection classifier to avoid the classifier being affected by false boundary detection during training. During the testing phase, the output of the boundary detection classifier is used to indicate which entity fragments should be considered when predicting the classification labels.

Multi-Task Learning Loss.
erefore, the total loss of the model is the sum of the losses of entity boundary detection and entity category judgment. Total loss is defined as the following formula: ...

Computational Intelligence and Neuroscience
α is a hyperparameter that is used as a mixing ratio parameter to control the introduction of information to control the importance of the two losses between entity boundary detection and entity category judgment. Use the boundary detection module to obtain the boundary information of the entity to obtain the internal and context information of the current entity, learn richer features through multi-task learning, optimize the head and tail span representation through the back-propagation of the loss function, and improve the classification of the entity span more accurately. e acquisition of the external information of the entity improves the recognition effect of the inner entity, and the inner entity information will also be transferred to the boundary detection model through the multi-task model to promote the boundary detection effect.
By implementing the above process, the algorithm flow of this model is shown in Table 1.

Experimental Environment Parameter Settings.
is paper uses the Windows system for experiments, based on the Python platform, using pycharm as a development tool. e model is constructed using the open source deep learning framework Tensorflow, which is developed and maintained by Google's artificial intelligence team, Google Brain. It is deployed on various servers, PC terminals, and web pages, and supports GPU high-performance numerical computing.
e named entity recognition model (MTL-BAM) based on multi-task learning and biaffine mechanism is experimented on two flat entity datasets JNLPBA [13], CoNLL2003, and three nested entity datasets GENIA [14], ACE2005, and ACE2004 [15]. It is compared with BAM that only uses a biaffine mechanism for entity recognition without using a multi-task learning framework, and the experimental results are compared with other models that achieve some results in the named entity recognition tasks. e evaluation indicators used are the precision rate P, the recall rate R, and the F value. e experimental parameter settings in this paper are shown in Table 2. e convolution kernel window size of the convolutional layer of the CharCNN network is set to [3][4][5], and the character vector dimension is 50 dimensions. Use the public pretrained word vector FastText as the choice of word vector, and the word vector dimension is set to 300 dimensions. BiLSTM sets the output dimension of 3 layers to 200 dimensions, and the dimension of two fully connected layers is set to 150 dimensions, reducing the output dimension of LSTM to prevent overfitting. At the same time, to prevent overfitting, the BiLSTM layer and the MLP layer, respectively, set the dropout [16] to 0.4 and 0.2. e Adam optimizer [17] is used as the optimizer for model parameter update during training, and the learning rate is set to 0.001. X lm t is obtained using the method followed by BERT 3: X char t is obtained using CharCNN network encoding 4: e corresponding word vector X word t is obtained through FastText 5: Connect X lm t , X char t , X word t to get X t 6: While (not traversed all X t ) do 7: Input BiLSTM layer to get output h t 8: Use two multilayer perceptrons for h t to get s(t), e(t) 9: While (NER model parameters did not converge) do 10: While (not traversed all s(t), e(t)) do 11: Input biaffine network training to get loss b 12: Input boundary detection module training to get loss d 13: Multi Loss � loss b + αloss d   Table 3. e best results in each group of experiments are shown in bold. Figure 6 is a histogram of the data distribution corresponding to Table 3.
It can be seen from the figure and table that the recognition result of the RNA entity type is the highest, because the RNA boundary in the data set is generally represented by mRNA and RNA, so the model has learned the boundary information of RNA, and the recognition effect is the best. erefore, it can be obtained that the entity boundary information plays an important role in the accurate identification of the entity. Except for the 0.05% drop in DNA type data, the F1 value of all other entity types has a certain effect improvement, among which the cell_type entity type with the least improvement has increased by 0.06%, and the highest cell_line entity type has increased by 1.44%, the overall F1 value increased by 0.22%, indicating that the MTL-BAM model has a certain effect compared with the BAM model. From the bold display in the table, after adding the multi-task learning model, the overall recall rate has been improved compared with before. e reason is that the boundary detection module enhances entity boundary supervision and obtains entity context representation information by obtaining entity boundary information. e span classification module provides a boundary representation, which enables the model to extract more correct entity segments. At the same time, from the perspective of accuracy, the results of the least number of RNA and cell_line have improved, indicating that the boundary detection module strengthens the connection between the inside and outside of the nested entity, so that the sparse entity learns more internal and external features, and the accuracy rate for other entity types is improved. ere is a downward trend, and the reason may be that the extracted   e model should also set more effective methods of multi-task learning to strengthen the connection between internal and external entities and improve the accuracy. It shows that the multitask model can improve the effect of the single-task biaffine mechanism.

Analysis of Experimental Results of ACE2004 and ACE2005 Datasets.
In the experiment, the performance of the MTL-BAM model on seven different entity categories of the two nested datasets ACE2004 and ACE2005 was further verified.
e experimental results on the ACE2004 and ACE2005 datasets are shown in Tables 4 and 5.
It can be seen from the table that the F1 value of the MTL-BAM model on the ACE2005 dataset is 0.39% higher than that of the BAM model, and the recall rate of the other six entity labels is the same except for the FAC entity category. e overall recall rate has also improved by 0.79%. On the ACE2004 dataset, the overall F1 value of the model has increased by 0.33%. Except for the same recall rate on the FAC entity category, other recall rates have been improved, and the overall recall rate has increased by 2.08%. e recall rate of the model in this chapter is higher than that of the BAM model on both data, and the overall F1 value is also higher than that of the BAM model, which proves the effectiveness of the MTL-BAM model on multitype nested datasets.

Analysis of Nested Data Set Comparison Experiment
Results.
e MTL-BAM model is compared with some existing neural network-based nested named entity recognition models. e experimental comparison results are shown in Table 6. e precision and recall of the experimental results are shown, and the final comparison with other entity recognition models only uses the F1 result as a comparison.
In the table, Ju et al. [1] and Zheng et al. [18] are methods based on sequence annotation, Katuyar, Wang and Lu are methods based on hypergraph, Luan, and Sohrab are spanbased method methods. Jana et al. [19] is an approach by using a linear model for nested label encoding. e method proposed in this chapter is higher than all above methods in recall rate and F1 value, and the accuracy rate is lower than that of Sohrab in the GENIA dataset, but because the precision rate and recall rate of this model are too different, it shows that the span-based model has a high precision and low recall rate due to incomplete span detection. Compared with Sohrab, the recall rate of this paper is improved by 16.68%, and the overall result is 3.64% higher. For the ACE2004 dataset, the F1 value is the highest result of the current model, which is 0.63% higher than the Luan model. e F1 value of the model in this chapter is the highest result on the ACE2005 dataset, and the F1 value of the model in this chapter is 0.88% higher than that of Strakova. Based on the above conclusions, the model in this paper verifies the effectiveness of recognition in nested datasets by comparing with other models. It is verified that the proposed multi-task framework enables the boundary detection module to enhance the entity boundary supervision and obtain entity context representation information by obtaining entity boundary information, and provides boundary representation for entity span classification module, which improves the effect of entity recognition. At the same time, it shows that the model in this chapter has an important contribution to improving the recall rate and balancing the F1 value.

Analysis of the Experimental Results of the JNLPBA
Dataset. e experimentally compared models for the JNLPBA dataset include the models of Wang et al. [20] and Song [21]. e former proposes to share character-and word-level information between related biomedical entities across different labeled corpora. e latter uses BioBERT, a domain-specific language representation model pretrained on a large-scale biomedical corpus, with the same principles as the BERT model. Moreover, it includes some of the models mentioned above. e comparison results of the MTL-BAM and BAM models on the dataset JNLPBA based on the same experimental environment are shown in Table 7.
On the JNLPBA dataset, the F1 value of the model in this paper has increased by 0.35%, the recall rate has increased by 1.25%, and the accuracy has decreased to a certain extent.
e experimental results verify that the model is equally effective on flat entities and nested entities. Table 8 shows the experimental comparison results between the MTL-BAM model and the other entity Table 6: Comparison results between MTL-BAM model and other entity recognition models on three nested datasets. Compared with the other entity recognition models, it is 1.15% higher than BioBERT using biomedical corpus as a pretraining model, and the verification model MTL-BAM is effective on the JNLPBA dataset on flat entities.

Analysis of Experimental Results of CoNLL2003
Dataset. For the dataset CoNLL2003, the following models will be used for experimental comparison: e sequence annotation model proposed by Lample et al. [22] uses the BILSTM-CRF model to recognize flat entities. Strubell [23] proposed an iterative dilated convolutional neural network (ID-CNN) for entity recognition, which has better large context and structured prediction capabilities than traditional CNNs. Devlin et al. [24] proposed to use the BERT pretraining model to fine-tune a large amount of pretrained corpus to improve entity recognition. Akbik et al. [25] exploits the internal state of a trained character language model to generate a novel word embedding that enhances contextual representation to improve entity recognition. e comparison results between the data set CoNLL2003 and the BAM model based on the same experimental environment are shown in Table 9.
On the CoNLL2003 dataset, the F1 value of the model in this paper has increased by 0.2%, the recall rate has increased by 0.66%, and the accuracy has decreased to a certain extent.
e experimental results verify that the model is equally effective on the flat entity CoNLL2003 and nested entities. Table 10 shows the experimental comparison results between the MTL-BAM model and the other entity recognition models mentioned above. Moreover, only the F1 value was compared to the CoNLL2003 dataset.
It can be seen from the table that the MTL-BAM model is 0.3% higher than the other entity recognition models. e model proposed in this chapter also has a certain effect on the two flat datasets, which shows that the named entity recognition model based on multi-task learning and double affine mechanism in this chapter is versatile in various datasets. However, for the named entity recognition method of the CoNLL2003 dataset, the MTL-BAM model still has a certain gap compared with the current SOTA method [26], mainly because the method proposed in this paper is more for nested named entities.

Conclusions
From the various experimental results, it can be found that the use of the multi-task learning framework improves the final performance to a certain extent, whether on the baseline model or on other models studied in the past, and it also proves that the method is effective on the flat entity recognition task. With no performance penalty, it is a general framework that can be used for both nested and flat entity recognition tasks.
In addition, the model in this paper still has many shortcomings. e current use of multi-task learning to build a boundary detection model can positively promote the entity classification module. However, for nested entities, the outer boundary can supervise the boundary of the inner entity, and the information of the inner entity has not yet guided the recognition of the outer entity, resulting in only a small improvement in the entire model. Secondly, this paper only studies flat entities and nested entities, and there are more complex entity types in practical applications, such as discontinuous entities. erefore, in the next step, we can improve the two-way interaction ability of the two tasks. And study the use of this model or the improved model to solve the recognition of various complex entities and improve the effect of various types of entity recognition.

Data Availability
e GENIA dataset is provided by the GENIA website (https://www.geniaproject.org/genia-corpus). e ACE2004 dataset is provided by the LDC website (https://catalog.ldc.upenn.edu/LDC2005T09). e ACE2005 dataset is provided by the LDC website Table 8: Comparison results between MTL-BAM model and other entity recognition models on JNLPBA dataset.