A Unified Model Using Distantly Supervised Data and Cross-Domain Data in NER

Named entity recognition (NER) systems are often realized by supervised methods that require large hand-annotated data. When the hand-annotated data is limited, distantly supervised (DS) data and cross-domain (CD) data are usually used separately to improve the performance. The distantly supervised data can provide in-domain dictionary information, and the hand-annotated cross-domain information can be provided by cross-domain data. These two types of information are complemental. However, there are two problems required to be solved before using directly. First, the distantly supervised data may contain a lot of noise. Second, directly using cross-domain data may degrade performance due to the distribution mismatching problem. In this paper, we propose a unified model named PARE (PArtial learning and REinforcement learning). The PARE model can simultaneously use distantly supervised data and cross-domain data as external data. The model uses the partial learning method with a new label strategy to better handle the noise in distantly supervised data. The reinforcement learning method is used to alleviate the distribution mismatching problem in cross-domain data. Experiments in three datasets show that our model outperforms other baseline models. Besides, our model can be used in the situation where no hand-annotated in-domain data is provided.


Introduction
Name entity recognition [1] as a fundamental natural language processing (NLP) task has received significant attention.
e NER system tries to label each word in sentences with predefined types, such as brand and product. e results of NER can be used in many downstream NLP tasks, such as coreference resolution [2], relation extraction [3], and question answering [4]. e supervised methods are often used to realize the NER system [5][6][7]. However, supervised methods often require large annotated data. In practice, obtaining hand-annotated data can be a very expensive and laborious process. When limited hand-annotated data is accessible, distantly supervised data and crossdomain data are widely used as external resources [8][9][10][11]. e distantly supervised data is obtained by the matching method. An example is shown in Figure 1. e word "Samsung" is matched by the entity dictionary. More in-domain data can be obtained through distant supervision. Cross-domain data can provide compensative information. e cross-domain learning methods utilize the high resource domain information to improve the low resource domain performance [10]. It is natural to combine these two types of data.
However, there are some problems with directly using two types of data. First, in distantly supervised data, when a word is not matched by the entity dictionary, the word has no clear label.
e matching method can lead to many errors. In Figure 1 (we use Chinese as a case of study; however, in order to better make readers understand, Chinese characters are not allowed in the figure, and we translate Chinese to English), the label of the word "mobile" should be product. Previous work uses a word-level method and a sentence-level method to alleviate errors [8].
In the word-level method, the partial learning method is used. In the sentence-level method, the reinforcement learning method is used. In the traditional partial learning method, the labels of the words are the whole label set when the words are not contained in the entity dictionary. For example, the label of "mobile" has 5 candidate labels (\{"B-Brand," "I-Brand," "B-Product," "I-Product," "O"\}). However, the traditional label strategy does not consider label features in NER. For example, the label of "mobile" cannot be "I-Brand." Second, in cross-domain learning, the parameter sharing method is a widely used method [10,12]. However, directly using the parameter sharing method may lead to performance reduction due to the distribution mismatching [13]. Previous works mainly focus on using different network architectures to capture private features and shared features [14]. In addition, these works cannot handle the situation where no hand-annotated domain data is provided.
In this paper, we explore a PARE model which can simultaneously use distantly supervised data and cross-domain data as external data. For distantly supervised data, we use a new label strategy considering label features of NER. e new label strategy can reduce redundant labels in the partial learning method. In cross-domain data, the reinforcement learning method is used to alleviate the distribution mismatching problem in the parameter sharing method. We evaluate our PARE model in three datasets and show that our methods outperform other baseline models. Besides, our PARE model can be used in the situation where no hand-annotated in-domain data is provided. We only use distantly data and cross-domain data as inputs and achieve competitive results. Contributions of our work can be summarized as follows: (i) We propose a PARE model. e PARE model can use distantly supervised annotated data and crossdomain data as external data simultaneously in Chinese NER. (ii) We propose an improved partial learning method to process distantly supervised NER data, which can better process the noise in distantly supervised data. (iii) We use the reinforcement learning method to process cross-domain NER data, which can alleviate the distribution mismatching problem in the parameter sharing method.

PARE Model
e architecture of the PARE model is shown in Figure 2. e model can be divided into three parts: core NER part, DS data selector part, and CD data selector part. e improved partial learning method is used in the core NER part to reduce the noise in distant supervision. e DS data selector part and CD data selector part are similar to those in [8,13].
e DS data selector removes the noise sentence by reinforcement learning. e CD data selector selects the relevant sentence in the cross-domain dataset, which reduces the domain distribution mismatching. rough reinforcement learning, we can use the distantly supervised data and crossdomain data as external data simultaneously.

Core NER.
e core NER is based on the LSTM-CRF model [7], which is widely used in named entity recognition. e core NER part contains three subparts: embedding layer, Bi-LSTM layer, and CRF layer. e input of the model is a sentence x � x 1 , x 2 , . . . , x n . x i is the word in sentence and x can be from hand-annotated data, distantly supervised data, or cross-domain data.
e output of the model is y � y 1 , y 2 , . . . , y n , which is the label of sentence x. To better process the distantly supervised data, a new label strategy is proposed in core NER.

Embedding Layer.
We use the embedding layer as the first step of a neural network model. e embedding layer maps the word into a low dimensional dense vector, which contains the semantics of the word. e embedding vector e i can be obtained as where x i is the input word and θ e is the embedding table. e embedding layer can also use an additional contextualized word embedding such as BERT [15]. e i is concatenated with the output of BERT.

Feature
Extractor. e feature extractor uses the output of word embedding. We use the Bi-LSTM as the feature extractor. e LSTM can handle gradient vanishing/ : An example of the improved label strategy in partial learning method. e NER system tries to recognize brands and products in ecommerce data. e brown blocks are impossible tags. "B-Brand" means beginning of a brand, "I-Brand" means inside of a brand, "B-Product" means beginning of a product, "I-Product" means inside of a product, and "O" means outside of entity. 2 Computational Intelligence and Neuroscience exploding problems well as shown in previous work [16]. A common LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. e Bi-LSTM concatenates forward LSTM output and backward LSTM output to capture information from context [17]. e output of Bi-LSTM h i can be represented as where θ l is the parameters in Bi-LSTM.

CRF.
We use Conditional Random Fields (CRF) to predict the label sequence, because the neighborhood information can be considered in CRF. For example, CRF considers that the "I-Product" cannot be behind the "B-Brand." For a sentence x and one possible label sequence y, we define the score to be where A is a matrix of transition score. A i,j represents the score of a transition from the tag i to tag j. P is the matrix of the output score. P matrix can be obtained by using a feedforward neural network after the Bi-LSTM output. e probability of the sequence y can be obtained by y∈A(y) e score(x,y) , where A(y) is all candidate label sequences. e traditional CRF can process hand-annotated data and cross-domain data well because every word in handannotated data and cross-domain data has an explicit label.
For distantly annotated data, if the words are in the dictionary, the words have explicit labels. In traditional label strategy, if the words are not in the entity dictionary, the labels are the whole label set. For example, the word "mobile" will have 5 labels. e partial learning method can better process these label strategies. For the input sentence x, there will be a set of possible label sequence C(y), because every word will have several labels. We compute the probability of the all possible predicted sequence C(y) by summing the probability of each possible label sequence y: y∈A(y) e score(x,y) .
However, the previous label strategy does not consider label features in NER. Consider that the whole label set may lead to many redundant labels. e redundant labels may harm the model performance. In this paper, we use a new label strategy.
e new label strategy removes some impossible labels. Some label features are shown as follows: (i) e start label cannot be "I-XXX." "XXX" represents an entity type, like "Brand." For example, the label of "I" cannot be "I-Brand" or "I-Product" in Figure 1. (ii) e label of punctuation is "O." (iii) e label "I-YYY" cannot be after the label "B-XXX" or "I-XXX." "YYY" represents another entity type. For example, the label of "mobile" cannot be "I-Product" in Figure 1.
By using the new label strategy, the number of possible label sequences (|C(y)|) may reduce, and the performance may increase. During training, the log probability of the Computational Intelligence and Neuroscience 3 correct sequence log(p(C(y) | x)) is maximized. During decoding, the output sequence is given by score(x, y).

Data
Selector. e data selector uses reinforcement learning method to select proper sentences. e overall training procedure is shown in Algorithm 1. In each epoch, we first select the cross-domain data and then select the distantly supervised data. Before doing the selection process, the hand-annotated data is merged with distantly supervised data and cross-domain data to obtain the merged data, respectively. To obtain more feedback, we divide the merged data into many random-size bags. Every sentence in bags obtains a state. During the selection process, for each sentence from distantly supervised data, the DS data selector has an action to decide whether to select the sentence or not. e CD data selector has an action to decide whether to select the cross-domain sentences or not. We directly select the sentences from the hand-annotated dataset. After doing the selection process in a bag, different selections will lead to different rewards. e DS reward is used as the feedback of the DS data selector and updates the DS data selector. e CD data selector is also updated through the domain reward. e goal of data selectors is to maximize the reward when the data selectors take action. After doing the whole selection process in dataset, the selected sentences are used to train the core NER, which can obtain the new state and reward in reinforcement learning. After the training processing, the core NER can be used as the NER system, which combines the distantly supervised data and cross-domain data information.
Some details of the reinforcement learning method are shown as follows. We use superscript t to discriminate different data and different data selectors. t can be ds, do { }, where ds, do mean distantly supervised and cross-domain, respectively.

State.
e state s t can be the input of the selector network. For different data inputs, the model has the same pattern to obtain the state. e state s t of the sentence contains the following information: (a) the serialized feature representation, which is extracted by the Bi-LSTM; (b) the label score is equal to P in equation (3).

Policy Network.
For each state s t , the action space a t is \{0, 1\}. "1" represents selecting the sentence, and "0" represents not selecting the sentence. For DS annotated data, the action a ds i is obtained from the DS data selector. e action a do i is obtained from the CD data selector in crossdomain data. e data selector π t θ (s t i , a t i ) is a multilayer perceptron with the parameter θ t . A logistic function is used as the policy function: where W t and b t are the parameters of selector. σ is the sigmoid function.

Reward.
e reward is used to evaluate the quality of the selection. e reinforcement learning method in crossdomain has a different reward compared with reinforcement learning method in distant supervision because the goal of the selection is different. e goal of the DS data selector is to select the sentences which have less noise. e goal of the CD data selector is to select sentences that are relevant to handannotated data.
For distantly supervised data, when all the selections in the current merged data bag are finished, a delayed average reward is obtained. e selected distantly supervised sentences and hand-annotated sentences are used to compute the reward: where x ds is the selected sentences, |X ds | is the number of the selected sentences, x ha is the hand-annotated sentences, and |X ha | is the number of the selected sentences. log p(C(y) | x ds ) and log p(y | x ha ) are computed by equation (5). e goal of the CD data selector is to select relevant sentences that fit the distribution of the target domain. For cross-domain data, we also obtained a delayed average reward. A reward is the distance between the selected domain data and hand-annotated data. We describe a simple distance in this paper, and the complex and better distance will be discussed in the future. e reward is set as e negative sign means that the distance value is small when the selected sentences are similar to the hand-annotated sentences. We use the maximum mean discrepancy to measure the distance. Φ(X do ) is the elementwise average of selected cross-domain sentences' state s do . Φ(X ha ) is the elementwise average of hand-annotated sentences' state s ha . Selector training: the policy gradient method [18] is used to optimize the policy network. e agents obtain the reward when all the actions have been done. For each bag, the feedback of every action r t (a t i ) is equal to the average reward r t . e selector is updated as follows: where |X t | is the number of selected sentences from distantly supervised data or cross-domain data.

No Hand-Annotated In-Domain Data.
Our model can be used in the situation where only distantly supervised data and cross-domain data are provided. e motivation is that we can use cross-domain data instead of in-domain handannotated data. e details are shown in Algorithm 2. First, we use the distantly supervised data to select cross-domain data because the cross-domain data selection does not require the hand-annotated in-domain data to obtain the reward. en, we use the selected cross-domain data to select the distantly supervised data. Finally, we use the selected cross-domain data and selected distantly supervised data to train the core NER part.

Datasets.
ree hand-annotated in-domain datasets are used to evaluate our methods: e-commerce (EC) dataset, news dataset, and broadcast conversation (BC) dataset. e number of sentences is shown in Table 1. e EC dataset and news dataset are from [8] (the only difference is that we consider three types of entities (Brand, Product, and Model) in the EC dataset because cross-domain EC data has these common label types). e cross-domain data of the EC dataset is the Taobao dataset in [19]. For news data, the cross-domain data is the Chinese web text domain data from OntoNotes [20]. We also use a new dataset to evaluate our methods. For BC data, we randomly select 3000 sentences as train data, 500 sentences as development data, and 1000 sentences as test data from the OntoNotes Chinese broadcast conversation dataset [20]. e distantly supervised data is obtained from the rest of the OntoNotes Chinese broadcast conversation dataset. e construction method is similar to that in [8]. e cross-domain data is from OntoNotes Chines web data [20]. 6000 cross-domain sentences are used.

Parameters Setting.
e pretrained embedding is shown to be helpful in previous works [7]. In core NER, the embedding is pretrained using word2vec [21] in a large usergenerated text corpus. e embedding dimension is 100. e dimension of the LSTM is set to 100. e optimization method is RMSprop [22]. e initial learning rate is 0.001. e dropout rate is 0.2. e minibatch is 128. e maxepoch iteration is 500. For DS data selector and CD data selector, the dimension of the hidden layer in the multilayer perceptron is 100. e optimization method is Adam [23]. e initial learning rate is 0.001.

Results.
For evaluation, the entity-level metrics of Precision (P), Recall (R), and their F1 score are used in this paper.
(i) Input: Hand data, cross-domain data, and distantly supervised data (ii) Output: Trained PARE model (1) for each epoch do (2) Merge hand data and cross-domain data (3) Divide merged data into many small bag1s (4) for each bag1 in bag1s do (5) for each sentence in bag1 do (6) Obtain the sentence state s t (7) if sentence in Hand data then (8) a t � 1 (9) else (10) Select cross-domain data through π t θ (s t i , a t i ) (11) end if (12) end for (13) Obtain reward r do (14) Optimize CD data selector through (10) (15) end for (16) Merge hand data and distantly supervised data (17) Divide merged data into many small bag2s (18) for each bag2 in bag2s do (19) for each sentence in bag2 do (20) Obtain the sentence state s t (21) if sentence in Hand data then (22) a t � 1 (23) else (24) Select distantly supervised data through π t θ (s t i , a t i ) (25) end if (26) end for (27) Obtain reward r ds (28) Optimize DS data selector through (10) (29) end for (30) Train the core NER using selected data (31) end for tantly supervised data as external data. e traditional partial learning method is used to process the distantly supervised data. e reinforcement learning method is used to delete the noise sentences. e code is the same as in [8].
(iii) CD-SHA (Hand + CD) [24]: the model uses crossdomain data as external data. We mix cross-domain data with hand-annotation in-domain data as input data and then share all model parameters to train the model.
(iv) CD-ADV (Hand + CD) [14]: the model uses crossdomain data as external data. An adversarial network is used to process the private and shared information between target domain and source domain. (v) BERT-CRF (Hand) [15]: the model uses the handannotated in-domain data as input. e BERT is used as a context embedding encoder, and CRF is used as a decoder. (vi) SoftLexicon (Hand) [25]: the model uses the handannotated in-domain data as input and can capture the lexicon information through a segmentation label set. (vii) SCDL (Hand + DS) [26]: the distantly supervised data is used as external data, and a self-collaborative denoising learning method is used to handle label noise in distantly supervised data. (viii) Multicell (Hand + CD) [27]: the cross-domain data is used as external data, and different label types are processed by different cells in Multicell LSTM.
Four systems can be built based on our method: (i) Input: Cross-domain data and distantly supervised data (ii) Output: Trained PARE model (1) Merge distantly supervised data and cross-domain data. (2) for Each epoch do (3) Divide the merge data into many small bag1s (4) for Each bag1 in bag1s do (5) for Each sentence in bag1 do (6) Obtain the sentence state s t (7) if sentence in distantly supervised data then (8) a t � 1 (9) else (10) Select cross-domain data through π t θ (s t i , a t i ) (11) end if (12) end for (13) Obtain reward r do (14) Optimize CD data selector through (10) (15) end for (16) Merge the selected cross-domain sentences and distantly supervised data (17) Divide the merged data into many small bag2s (18) for Each bag2 in bag2s do (19) for Each sentence in bag2 do (20) Obtain the sentence state s t (21) if sentence in cross-domain data then (22) a t � 1 (23) else (24) Select distantly supervised data through π t θ (s t i , a t i ) (25) end if (26) end for (27) Obtain reward r ds (28) Optimize DS data selector through (10) (29) end for (30) Train the core NER using selected data (31) end for ALGORITHM 2: e training procedure in no in-domain hand-annotated data.  e results of the models are shown in Table 2. We first analyze the models using traditional word embedding. From Table 2, the models using external data yield improvement compared with the LSTM-CRF model using hand-annotated data as input. is shows that using external data can be very helpful when hand-annotated data is limited. Different ways of using external resources will lead to different effects. e DS-IPA-RL model outperforms the DS-PA-RL model, showing that the new label strategies are helpful in the partial learning method. e CD-ADV model obtains worse results than the CD-SHA model. e reason may be that the CD-ADV model cannot utilize the CRF information sharing when the output label of the source domain and target domain are the same. e CD-RL model achieves the best performance compared with other cross-domain models. e PARE model achieves the best performance of all models.
is indicates that our PARE model can utilize different external data well.
e Pretrained Language Model (PLM) has achieved competitive performance in the NER task, and our model is orthogonal with PLM. e results of models using BERT embedding are shown in Table 2. We can observe that our model can also improve the performance in the BERT situation. e results also show that PARE-BERT outperforms other baseline models using BERT. However, the improvement rate is smaller compared with the models without BERT.
e reason may be that the language model has contained much information and the information gain brought by the unified model is relatively small.
To better analyze the different parts of the model, we divide our PARE model into some small parts to evaluate our methods. First, we will analyze the effect of using the label strategy in the partial learning method. en, we evaluate the reinforcement learning method in cross-domain learning methods. Finally, we show the results in the situation where no hand-annotated data is provided.

Improved Partial Learning.
We compare the new label strategy with two label strategies and do not consider the influence of reinforcement learning. e first is the FA label strategy. In the FA label strategy, for the word that is not contained in the entity dictionary, the label is "O." To utilize the FA data, we use the simple LSTM-CRF model [7]. e second is the PA label strategy. For the word that is not contained in the entity dictionary, the label is the whole label sets. We use the LSTM-CRF-PA model to utilize the PA data [8]. e results are shown in Table 3. e results show that the models using external data do not always obtain better results than the model only using hand-annotated data. Different label strategies achieve various results. e model using the IPA label strategy obtains the best results compared with the other models using hand-annotated and distantly supervised data. ese facts show that label strategy is very important when distantly supervised data is used, and our new label strategy can be more effective than other label strategies. Table 4, we explore the performance of the model compared with the CD-SHA model.

Cross-Domain Learning. In
e CD-SHA model uses different amounts of randomly selected domain data. From the table, the CD-RL model outperforms the CD-SHA model using all domain data. is indicates that the reinforcement learning method Computational Intelligence and Neuroscience 7 can select relevant sentences and reduce negative transfer. e table can also show that the CD-RL model achieves the best results compared with the CD-SHA model using different amounts of domain data. ese facts show that reinforcement learning selection is better than random selection.

No Hand-Annotated In-Domain Data.
Our PARE model can be extended to no hand-annotated situation. e model only uses distantly supervised data and cross-domain data as input, which greatly reduces the time of manual annotation. e training method is shown in Section 2.3. We compare our methods with three baselines: e model uses only distantly supervised data as input. e improved label strategy is used to process the distantly supervised data.
(ii) SHA. e model uses cross-domain data as input to train the model, and we directly transfer the model from the source domain to the target domain. (iii) SHA-IPA. e model uses distantly supervised data and domain data as input. e model is the same as that in [8], and we use domain data instead of handannotated in-domain data.
e results in three different datasets are shown in Table 5. From the table, we find that directly using distantly supervised data achieves very low performance. e reason can be that the data only from the distantly supervised method contains a lot of noise, especially the low coverage.
e results of the SHA model are slightly better than those of the LSTM-IPA model. e SHA-IPA achieves the best results among all models. ese facts show that our methods can work well in the situation where no hand-annotated data is provided.

Error Analysis
To show the detail of our PARE model outperforming the baseline model, we do some error analysis. We consider five error types: (i) Containing: the gold entity boundary contains the predicted entity boundary. (ii) Be-contained: the gold entity boundary is contained by the predicted entity boundary. (iii) Cross: the gold entity boundary is crossed with the predicted entity boundary. (iv) Type-error: the gold entity boundary is the same as the predicted entity boundary, and the gold entity      Computational Intelligence and Neuroscience has a different type compared with the predicted entity. (v) No-cross: there is no boundary crossing between the gold entity and the predicted entity; that is, the model predicts the unlabeled entity or forgets to predict the gold entity.
Some cases in EC are shown in Figure 3, and the rate of the LSTM-CRF model in different error types is shown in Figure 4. e main error comes from no-cross error, accounting for 40%. It may be a good choice for a model to focus on no-cross error. In Figure 5, we show the performance of the PARE model in different error types. e model incorporated distantly supervised data, and crossdomain data can greatly reduce the number of no-cross errors.

Related Work
Recently, most NER systems are based on supervised methods, such as conditional random field model and neural network model. Conditional random field model relies heavily on hand-crafted features [5,28]. e features of the sequence can be extracted automatically in the neural network model [6,7,25,[29][30][31]. Recently, the methods of using pretraining language models have become the mainstream for solving named entity recognition. e pretraining language models [15,[32][33][34] can better capture the context information of the current token. In this paper, we build our models based on the LSTM-CRF model [7] and BERT [15], which are prevalent neural network models.
In the named entity recognition task, the distantly supervised data is a widely used external data [8]. A lot of methods focus on reducing the noise in the distantly supervised NER data [8,9,19,35]. Yang et al. introduced a partial learning method to process the distantly supervised data [8]. Recently, Peng et al. have explored a positiveunlabeled method to process distantly supervised NER data in English [9]. Jie et al. proposed a cross-training model that can process the incomplete annotation situation [19]. A structural causal model is introduced to solve the dictionary biased problem in [35]. Zhang et al. proposed self-collaborative denoising learning methods, which jointly train two teacher-student networks in a mutually beneficial manner to iteratively perform noisy label refinery [26]. However, all these methods do not consider the NER label features. In this paper, we build our model based on [8] and propose a new label strategy using the NER label feature.
Parameters sharing method is a widely used method in cross-domain learning methods [10,12,27,36]. Directly transferring cross-domain data information may lead to performance decline [37,38]. Data selection can be useful in cross-domain learning methods for preventing negative transfer from irrelevant sentences [39,40]. Besides, Jia and Zhang proposed the Multicell Compositional LSTM for domain adaptation. Motivated by [13], we introduce the cross-domain data selection to NER and use distantly supervised data as another external data.
In recent years, the reinforcement learning method has been used widely in natural language processing [8,13,41]. Feng et al. used instance selection as a reinforcement learning process in relation classification [41]. Yang et al. used the reinforcement learning method to select sentences from the distantly supervised data in NER [8]. Liu et al. used the reinforcement learning method for domain adaptation in POS tagging, dependency parsing, and sentiment analysis [13]. In this paper, we explore a PARE model that used distantly supervised data and cross-domain data simultaneously by the reinforcement learning method.

Conclusion
In this paper, we propose a PARE model that can simultaneously use distantly supervised data and cross-domain data as external data in NER. In the PARE model, first, we introduce a new label strategy to traditional partial Computational Intelligence and Neuroscience 9 annotated learning. e strategy can reduce the redundancy path to improve performance. en, we introduce reinforcement learning to reduce the noise in distantly supervised data and distribution differences in cross-domain data. Finally, the model can be used in the situation where no hand-annotated data is provided. e experiments in three datasets show that our method can work perfectly.

Conflicts of Interest
e authors declare no conflicts of interest.