MAF-CNER : A Chinese Named Entity Recognition Model Based on Multifeature Adaptive Fusion

. Named entity recognition (NER) is a subtask in natural language processing, and its accuracy greatly aﬀects the eﬀectiveness of downstream tasks. Aiming at the problem of insuﬃcient expression of potential Chinese features in named entity recognition tasks, this paper proposes a multifeature adaptive fusion Chinese named entity recognition (MAF-CNER) model. The model uses bidirectional long short-term memory (BiLSTM) neural network to extract stroke and radical features and adopts a weighted concatenation method to fuse two sets of features adaptively. This method can better integrate the two sets of features, thereby improving the model entity recognition ability. In order to fully test the entity recognition performance of this model, we compared the basic model and other mainstream models on Microsoft Research Asia (MSRA) and “China People’s Daily” dataset from January to June 1998. Experimental results show that this model is better than other models, with F1 values of 97.01% and 96.78%, respectively.


Introduction
Word representation learning has been widely concerned as a basic problem in the field of natural language processing.Unlike traditional one-hot representations, low-dimensional distributed vocabulary representations (also called word embeddings) represent words as low-dimensional dense real number vectors, which can better capture the associated information between natural language words.is form of representation is very useful in some downstream tasks of natural language processing, for example, text classification [1], NER [2,3], relation extraction [4,5], and sentiment analysis [6,7].erefore, how to obtain a better semantic representation of words is crucial.
In recent years of research, the NER model is mainly based on deep learning, and with the development of deep learning, more and more remarkable results have been achieved.e main basic model framework of English NER is Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) [8], which uses word embedding as the basic unit of predicting labels.English is a kind of phonetic alphabet, but Chinese characters represent typical meanings, so these research methods cannot be directly applied to Chinese.Unlike English, Chinese sentences do not have as obvious separators as in English.erefore, when processing Chinese NER tasks, first use word segmentation tools to segment sentences, and then implement a sequence tagging model based on segmentation words.is method results in poor performance of CNER because CNER faces the following difficulties: (1) the quality of sentence segmentation has a great impact on the performance of NER.For example, "武汉市长江大桥" (Wuhan Yangtze River Bridge) as a whole location named entity, after segmentation by word segmentation tools, it may be segmented into "武汉" (Wuhan), "市长" (mayor), and "江大 桥" (river's bridge).When these participles are used as input to the NER model, they will be recognized as three different named entities.(2) In order to solve the problem of word-level embedding, character-level embedding is widely used in NER tasks, but it still has many shortcomings.As shown in Figure 1, "人" (people), "八" (eight), and "乂" (yi) are semantically unrelated, and the stroke sequence is "丿" (leftfalling stroke) and " ɭ" (right-falling stroke).Chinese characters with the same stroke sequence have completely different semantics, and the stroke sequence of a Chinese character cannot uniquely identify a Chinese character.Similarly, using the radical feature alone will also encounter the same problem.In order to solve this problem, we can introduce another internal characteristic of Chinese characters-the roots; the radicals of "人", "八", and "乂" are "人", "八" , and "丿" , respectively.Combining the two characteristics of this stroke and the radical can distinguish Chinese characters well.Figure 1 Integrating the internal characteristics of Chinese characters is effective for learning Chinese word embedding [9].For example, Yin et al. used Convolutional Neural Networks (CNNs) to extract radical features, aiming to capture the intrinsic and intrinsic correlation of characters.Experimental results show that the model has achieved good performance in the field of Chinese clinical NER [10].Chinese characters have rich internal structural features.How to better learn and use these features to improve the quality of Chinese character embedding is very important.It can be further studied on how to better combine the character characteristics of Chinese characters with the internal characteristics of Chinese characters.
is article designs a multifeature adaptive fusion (MAF) method to fuse the stroke features and radical features of Chinese characters. is method can adaptively calculate the weight of the fusion stroke feature and radical feature.e main contributions of this article can be summarized as follows: (

Related Works
e traditional solution to the NER problem mainly includes three methods: rule-based method, statistics-based method, and dictionary-based method [11].e method based on rules and dictionaries requires professional linguists to write rules by hand, requires a lot of time, and has poor portability in different fields.In the task of NER, statistical methods mainly use Conditional Random Field (CRF) and Hidden Markov Model (HMM) [12,13].Although the accuracy rate is improved compared with the method based on rules and dictionaries, it still has disadvantages such as long training time.
With the continuous development of deep learning, researchers began to apply deep learning to NER tasks.Compared with traditional models, neural network models can learn deeper semantic feature information with almost no need for feature engineering [14] and domain knowledge.
ese models further improve the accuracy of entity recognition, especially the BiLSTM-CRF model [15,16], and can significantly improve the performance of NER tasks.
e standard model for solving NER problems in the English domain is the BiLSTM-CRF model proposed by Huang et al. [17], which is more robust and less dependent on word embedding.Based on this structure, Lample et al. proposed to use BiLSTM to extract word representations on character-level embedding.Cho et al. proposed a deep learning NER model that effectively represents biomedical word tokens through the design of a combinatorial feature embedding, enhanced by integrating two different character-level representations extracted from CNN and BiLSTM [18].In the Chinese field, CNER is more challenging [19].Wang et al. proposed a CNN model based on a gating mechanism (GCNN) [20].Cao et al. used Chinese character strokes as features and proposed the stroke n-gram model, which not only excavated the feature information of Chinese character strokes but also more effectively used the semantic information of Chinese characters to train word vectors [21].Cao et al. proposed a novel adversarial transfer learning framework to make full use of the boundary information shared by tasks and prevent the task-specific functions of Chinese word segmentation [22].Xu et al. proposed a simple and effective neural network framework ME-CNER (Multiple Embeddings for Chinese Named Entity Recognition), which embeds rich semantic information at multiple levels from radicals, characters to words [23].Wu

et al. proposed a radical-based CNER RCBC (R-CNN-BiLSTM-CRF).
e RCBC-based model uses CNNs to automatically extract the semantics of the radicals of Chinese characters and combines the word vectors and radical vectors into joint vector.is method can reduce the semantic deviation of radical features and capture semantic information more accurately [24].Ye et al. proposed a CNER model based on character-word vector fusion.is model reduces the dependence on the accuracy of word segmentation algorithms and effectively utilizes the semantic features of words [25].In order to solve the ambiguity of Chinese words and the lack of word boundaries, Wu et al. proposed a novel fine-grained 2 Complexity character-level representation method to capture the semantic information of Chinese characters [26].Although the above methods have achieved good results, none of them have a more in-depth exploration of the internal characteristics of Chinese characters, and the fusion methods between multiple characteristics can be studied more deeply.e model is divided into three layers: character, stroke, and radical multifeature vector fusion layer; BiLSTM layer; CRF layer.e radical and stroke feature representations are calculated by the BiLSTM neural network, merged using the weighted concatenation method and concatenated with the character vector to form the final input vector.BiLSTM extracts the context features of the current input vector.e input of the CRF layer is the output vector of the BiLSTM layer, and the CRF layer will decode the information and obtain the best tag sequence.We will introduce the components of the Chinese clinial NER model based on MAF from bottom to top, as follows.

Character, Stroke, and Radical Multifeature Vector Fusion
Layer.For a given sentence sequence x � (c 1 , c 2 , . . ., c n ), the embedding vector is composed of Chinese characters.Character characteristics are e c ∈ R d c , radical characteristics, e r ∈ R d r , and stroke characteristics, e s ∈ R d s , respectively.As shown in Figure 3, the embedding vector of each character C i can be expressed as follows: i ⊕ m * e r i ⊕ n * e s i . (1)

Character
Embedding.Character-level embedding has been widely used in natural language processing.Research shows that embedding pretrained characters in a specific field can improve system performance.For example, adding character-level features in neural machine translation [27,28] can improve the translation performance of the system, text classification [29,30], and NER also uses character-level representation.erefore, the pretrained character embedding is better than the random initial character embedding.is article uses the Chinese Wikipedia corpus of May 2020 to pretrain Chinese character embedding through Word2Vec.After preprocessing, about 171M training corpus is finally obtained.e pretraining of character embedding is implemented with the Python version of Word2Vec in Gensim, and the dimension of the feature vector is set to 100.

Radical Features.
A Chinese character is a kind of pictograph, and the radical is the first stroke or shape of a Chinese character.One of the most notable features of Chinese characters is that they contain a lot of semantic information at the radical level.e radicals of Chinese characters have a very important impact on the semantics of Chinese characters.For example, "胖" (fat) , "胸" (chest), and "肺" (lung).e main radical "月" (moon) is a simplified form of "肉" (flesh), which stands for meat, indicating that these characters are related to organs.A total of 228 radicals such as "鹿" (deer), "卤" (halogen), and "丶" (dot) are numbered from 1 to 228.However, the research in the traditional model mainly focuses on the semantic research at the phrase level.
is article uses the BiLSTM network to extract the semantic information of the corresponding radicals of Chinese characters.Figure 4 shows the overall structure of the model in detail.e expression is as follows: In formula 2, h t → ; h t ←  is the hidden layer vector obtained by training BiLSTM network.

Stroke Characteristics. Stroke usually refers to the uninterrupted dots and lines of various shapes that compose
Chinese characters, such as horizontal ("一"), vertical ("丨"), and left-falling stroke ("丿") and dot ("丶").It is the smallest continuous stroke unit that constitutes a Chinese character.
As shown in Table 1, we divide the strokes into five types with the corresponding numbers of 1 to 5. e Chinese character writing system provides a guide for the stroke order of each Chinese character.With this stroke information, we can decompose Chinese characters into strokes in a specific stroke order.
is sequence information can be used when learning the internal semantic information of Chinese characters.
erefore, this article Complexity 3 uses the BiLSTM network to extract the contextual semantic information of Chinese character strokes.Figure 4 shows the model structure.is method can learn more Chinese character graphic features.e expression is as follows: where H i,j is the j-th stroke feature vector of the i-th Chinese character.

Adaptive Feature Fusion.
As shown in Figure 4, this article takes the stroke feature as the main feature, calculates its similarity with the character vector obtained by Word2Vec training, and determines its weight m according to formula 4.
where e c is a character vector and e s is a stroke vector.e radical feature is used as an auxiliary feature, and the importance of the radical itself is calculated according to formulas 5 and (6), and its weight n is determined, and the weighted series method is used to fuse the two sets of features.
is method can not only learn more graphic features of Chinese characters but also make the combination of the two features more balanced.
In formula 5, a and b are trainable parameters.e features of adaptive fusion are expressed as follows: where

CRF Layer.
Compared with the HMM, CRF does not have the strict requirements of the independence assumption of HMM and can effectively use both the internal information of the sequence and the external observation information, avoiding the problem of labeling bias and directly assuming the possibility of labeling and performing differentiated modeling.CRF can capture more dependencies: for example, "I-LOC" tags cannot follow "B-PER" [20].In CNER, the input of CRF is the context feature vector learned from the BiLSTM layer.For input text sentence, Let P i,j denote the probability score of the j-th label of the i-th Chinese character in the sentence.For a prediction sequence y � y 1 , y 2 , . . ., y n  , the CRF score can be defined as follows: where M is the transition matrix and M i,j represents the transition score from label i to j. y 0 and y n+1 represent the start and end tags, respectively.Finally, we use softmax function to calculate the probability of the sequence y as follows: During the training process, maximize the log probability of the correct label sequence: In the decoding stage, we predict that the maximum score obtained by the output sequence is as follows: In the prediction stage, the dynamic programming algorithm, Viterbi, is used to solve the optimal sequence.

Experimental Data and Evaluation Indicators.
In order to evaluate the model proposed in this article on the task of CNER, this article conducted experiments on two different widely used datasets, namely, the MSRA dataset and the "China People's Daily" dataset from January to June 1998.Table 2 shows the statistical information of the data set used in this article.

MSRA.
It is a general dataset for CNER.e dataset contains three named entities: PER (person), LOC (location), and ORG (organization).
e training set contains 46364 sentences, and the test set contains 4365 sentences.
is article uses the ternary tag set {B, I, O} to mark, B represents the first word of the entity, I represents the remaining words of the entity, and O represents the nonentity word.

China People's Daily.
e China People's Daily corpus was released by the Institute of Computational Linguistics of Peking University from January to June 1998.e entity categories are PER (person), LOC (location), and ORG (organization), also using the ternary tag set {B, I, O} for labeling.is article uses the data from January to May 1998 as the training set and the validation set.e validation set is 1/5 of the total data from January to May. e data from June 1998 is used as the test set.

Complexity
In order to fully evaluate the performance of the model, we use Precision (P), Recall (R), and harmonic average F1score (F1) as the evaluation criteria for model performance, which is defined as follows:

Model Building and Parameter Setting.
e model in this article is built using PyTorch.PyTorch was launched by the Facebook Artificial Intelligence Research Institute (FAIR) in January 2017 based on Torch and is widely used in applications such as NLP.e experimental parameters are set as follows: embedding dimension (embedding_dim) is 300, input dimension max_length is 80, and training set batch_size: China People's Daily dataset is 100, MSRA dataset is 128, and MSRA dataset is 64.e training learning rate is set to 0.001, in order to prevent overfitting during training; the weight decay factor weight_decay is set to 5e − 4 , dropout technology is used to prevent overfitting, and the value is set to 0.5.

Experimental Results.
In order to objectively evaluate the model performance of this model on the MSRA dataset and the "China People's Daily" dataset, LSTM, BiLSTM, and BiLSTM-CRF models are used for performance analysis.
e experimental results are shown in Table 3.
In Table 3, from the comparison of the experimental results of LSTM and BiLSTM, it can be seen that the latter performs better than the former.
is verifies that the BiLSTM network can better capture the context information of the serialized text, with stronger learning ability that is better than LSTM .In the comparison between BiLSTM and BiLSTM-CRF, after adding the CRF module, it can be seen that the BiLSTM-CRF model has various aspects.Both are better than BiLSTM, which is mainly due to the fact that CRF considers the global label information in the sequence during the decoding process, which improves the performance of the model.Our model introduces two features of strokes and radicals on the basis of character-level embedding, and the test results on the two datasets achieve the best performance.
In order to verify the effectiveness of this method, it is compared with other mainstream NER methods.e specific results are shown in Tables 4 and 5.In Table 4 en, word-level CRF layer was used to identify named entities, and the F1 value on the MSRA dataset reached 86.51% [32].Zhou et al. took CNER as a joint recognition and classification task based on a global linear model [33].e model used the rich manual feature model proposed in the literature [41] to greatly improve the performance of CNER.e F1 value of another BiLSTM-CRF neural network model proposed by Dong et [40]. is article uses BiLSTM-CRF as the basic model and introduces two kinds of internal semantic information of Chinese character strokes and radicals.Model performance F1 increased to 96.78%.

Conclusion
In view of the insufficient representation of potential features of Chinese characters, this article uses BiLSTM network to learn the internal strokes and radical semantic information of Chinese characters and combines with the BiLSTM-CRF model to construct an adaptive multifeature fusion embedded CNER model.e assessment was conducted on the MSRA corpus and the corpus of China People's Daily" from Janu"ary to June 1998.Compared with other mainstream methods, the model in this article achieves the best results on both corpora.e biggest advantage of this model is that the weighted concatenation method is used to adaptively fuse two kinds of semantic information in Chinese characters, while previous research only stayed at the word-level embedding or used one kind of internal characteristic semantic information of Chinese characters. is will make the embedding layer insufficiently represented, the performance of the model will be relatively reduced, and the named entity cannot be correctly identified.Combining the two internal features can make Chinese character features more fully represented, avoiding the problem where a single feature cannot correctly distinguish Chinese characters, and the proportion of the two semantic information combinations is more balanced through weighting, and the best combination effect is achieved.
is section introduces the network layer organization structure of the "multifeature adaptive fusion Chinese named entity recognition model" model, as shown in Figure 2.

Figure 4 :
Figure 4: Stroke and radical feature calculation and fusion.
TP (True Positive) indicates the correct number of samples in the positive examples, FP (False Positive) indicates the number of incorrect samples in the negative examples, and FN (False Negative) indicates the number of incorrect samples in the positive examples.

Table 1 :
Stroke type and number.
c t represents the current state of the input.ct represents the update status at h t that is the output at t.In order to use character context information at the same time, the model in this article uses BiLSTM to get the context vector of each character, which is a combination of forward LSTM and reverse LSTM.For a given sentence x � (x 1 , x 2 ,. ..,x n ), we use h t → to represent the hidden layer state of the forward LSTM at time t, whereas h t σ represents the sigmod activation function.tanh represents the hyperbolic tangent function.X t represents unit input.i t , f t , o t represent the input gate, forget gate, and output gate at time W and b, respectively, which represent the weight and deviation of the input gate, forget gate, and output gate.← .
, Chen et al. used CRF based on character features, and the F1 value was 86.20% [31].e model of Zhou et al. used a multistage model.ey used a character-level CRF model to segment the sequence.
[36]was close to 90.95%. is model used both character-level and radicallevel representations in the input of the model structure[34].Zhang et al. used a lattice LSTM model for CNER. is model encodes the input character sequence and all possible words matching the dictionary.eF1value of the model reached 93.18%, but the authors did not use the development dataset and trained the lattice LSTM mode[35].Zhao et al. used a pretrained language model to encode the input sequence as a contextual representation and designed a new model that combines neural networks with BERT; the F1 value of the model reaches 95.28%[36].However, using the model in this article, the F1 value reaches 97.01%.Johnson proposed the comprehensive embedding, which can take character, word, and position into account, has a valid structure, and can seize effective information.Regarding the test performance on MSRA dataset, F1 value reached 92.99%.Compared with the above model, this model has the best performance.Table5shows test model performance using China People's Daily dataset.Collobert et al. used a feedforward neural network, combined with preprocessing, affixes, and capitalization features, and achieved a result of 88.50% F1 [38]; Lample et al. input character-level word vectors into the BiLSTM-CRF model and achieved F1 value of 90.08% [19]; Chiu et al. combined BiLSTM with the CNN model and

Table 2 :
Dataset statement summary table.

Table 4 :
Experimental results of MSRA dataset.

Table 5 :
. Experimental results of China People's Daily.

Table 3 :
Comparison results between the model in this article and the basic model.