A Multitask Deep Learning Framework for DNER

Over the years, the explosive growth of drug-related text information has resulted in heavy loads of work for manual data processing. However, the domain knowledge hidden is believed to be crucial to biomedical research and applications. In this article, the multi-DTR model that can accurately recognize drug-specific name by joint modeling of DNER and DNEN was proposed. Character features were extracted by CNN out of the input text, and the context-sensitive word vectors were obtained using ELMo. Next, the pretrained biomedical words were embedded into BiLSTM-CRF and the output labels were interacted to update the task parameters until DNER and DNEN would support each other. The proposed method was found with better performance on the DDI2011 and DDI2013 datasets.


Introduction
With the rapid development of biomedicine and the exponential growth of publications have made it hard to extract a number of drug-related information. It is essential to extract valuable information if we want to make the best of medical text. Medicine is a class of chemical substances that are highly associated with biological research. It is of vital significance to observe how to accurately capture the entity information as contained in medicine. Drug refers to chemical name, generic term, or brand name. As a chemical product usually has a complex name, the brand name may not exactly identify a drug in the expiry of relevant patents. For example, the drug "quetiapine" is associated with the brand name "Seroquel XR." erefore, a special generic term, which needs to be explicitly defined for drug approval, should be designed for standard scientific reports and labels. Drug-specific names are subject to tight control by WHO (World Health Organization) and some organizations in the USA and elsewhere. For example, the European Medicines Agency (EMA) finalized the naming scheme fit to drug function for ease of pronunciation and translation and developed some criteria that differentiate a drug name from others so as to avoid any transcription and replication error in the R&D process [1]. is would justify the automatic extraction of potential medical information from massive biomedicine-related publications as a crucial part of biomedical research and industrial medicine manufacturing.
Drug-Named Entity Recognition (DNER), which is intended to identify the drugs referred to in unstructured drug texts, is an underlying task of recognizing the span and type of the named entity subordinated to predefined semantic types. Unlike ordinary NERs (Named Entity Recognition), DNER generally consists of long label sequences and contains plenty of alternate spellings of synonyms and entities, resulting in the inefficiency of drug dictionary and hard detection of entity boundaries. In this regard, Drug-Named Entity Normalization (DNEN) is also believed to be a crucial task.
DNEN, which is intended to map the acquired DNERs to a controlled vocabulary, is usually considered a task subsequent to DNER. Both DNEN and DNER can be deemed as sequence labeling problems. Figure 1 illustrates an example with respect to DNER and DNEN tasks, the input text contains the drug-specific name "Omeprazole" and the R&D organization "Astra Pharmaceuticals", and the label of each word in the text and its entity ID are output.
As the naming scheme, evaluation criteria and crossborder synchronization have been developing dynamically for many years, and there is no definitive dictionary or grammar applicable to drug names; DNER and DNEN processes are subject to many challenges: (1) the rapid updates of drug-related knowledge make it hard for a handmade dictionary to meet actual needs; (2) language tends to be complex and there is a scarcity of high-quality label texts; (3) the simple modeling of DNER and DNEN cannot allow both processes to support each other.
It is intended that the proposed model can capture more resourceful semantic features and identify the representation of polysemous and ambiguous words in drug sequence, thus accurately recognizing drug names. A multitask deep learning model multi-DTR (Multi-Drug Tip Recognition) was proposed, and the principal contributions of this work were that text information can be exploited by extracting the character-level representations of words, embedding words based on biomedicine pretraining, and extracting the features by context-sensitive word embedding after ELMo (Embeddings from Language Models) training. To make the best of the training data, a multitask learning strategy was taken, which allows for the explicit feedback of DNER and DNEN and makes different tasks support each other.
is article is structured as follows: in Section 2, some related works on DNER and DNEN were presented; in Section 3, the proposed neural network framework was described; in Section 4, relevant datasets and parameter setups were briefed; in Section 5, the result of the assessment was reported in particulars; in Section 3, a conclusion was drawn.

Related Works
NER is one of the underlying tasks in NLP, but there are a limited number of related works on DNER [2][3][4]. e access to some large-scale biomedical corpora [5][6][7] has enabled some generic NER models to be widely used in DNER. Common methods applicable to DNER can be roughly categorized into rule-based methods [8], dictionary-based methods [9], and machine learning-based methods [10]. In the case of rule-based methods, a number of labor resources are required to lay down rules, but the ambiguity and variability of terms are overlooked. If the target text appears to be complex, rule-based methods are found with a low recognition rate [11]. Tsuruoka et al. [12] made use of logistic regression to learn string similarity measures from the dictionary and performed soft character matching to avoid large difference of association due to exact string matching. Hettne et al. [13] developed a rule-based method for term filtering and disambiguation, then merged dictionaries to recognize small molecules and drugs as contained in the text. Eriksson et al. [14] created a Danish dictionary to recognize Adverse Drug Event (ADE) that may potentially occur in unstructured clinical narrative text. Despite this, the actual application needs can hardly be met due to a lack of dictionary and rapid update of biomedicine terms. e machine learning-based NER is currently a prevailing research interest. Cocos et al. [15] used ZRNN coupled with pretrained word embedding to recognize ADE on Twitter. Zeng et al. [16] performed automatic searching of words and character-level features in drug texts on LSTM-CRF (Long Short-Term Memory-Conditional Random Field) structure. To date, BERT (Bidirectional Encoder Representations from Transformers) [17] is the great hit model in the sector of Natural Language Processing (NLP). In the case of BERT, a transformer encoder was used and the upper and lower layers of the model are fully connected by a self-attention mechanism so that text information can be better processed. Lee et al. [18] ran a large-scale pretraining in respect of BERT (treated as a basic model) on PubMed and PMC and then developed the BioBERT (Biomedical Bidirectional Encoder Representations from Transformers) model. Despite the extraordinary properties, this model caused an enormous consumption of hardware resources in the training process.
DNEN, also a key part of information extraction, is generally listed as a subtask [19,20] for some biomedicinerelated NLP assessment tasks. Kang et al. [21] normalized disease-specific names by constructing a symptom text model and performing a comparative analysis. Lee et al. [22] used a dictionary to look up and standardize the entity. Lou et al. [23] proposed a transition-based model applicable to the recognition and normalization of joint disease entity, but such model heavily relies on handmade features and task types.

Neural Network Framework
In this article, the character feature representations (e.g., amidopyrine, aminophenazone, and aminopyrine) of an input word were extracted through Convolutional Neural Networks (CNN). Next, the extracted character features and words were embedded and input to BiLSTM (Bidirectional Long Short-Term Memory). e two-way LSTM (Long Short-Term Memory) was used to capture two separate hidden states (forward and backward) of each sequence, obtain the context-sensitive information, then connect two hidden states until the final output is generated. In the final step, the output vector of BiLSTM was backfed to CRF for jointly modeling the label sequence. DNER and DNEN can give back to each other by the output of two tasks, reduce the load of calculations, and realize the enhancement effect of both tasks.

Embedded Layer.
For deep mining of drug-related information in the input text, the features were extracted by pretrained word embedding, context-sensitive word embedding, and character embedding.

Pretrained Word Embedding.
e rapid development of deep learning technology has led to an extensive use of word embedding, which offers an alternative to numerical representation of text (such as Word2Vec [24] and Glove [25]). Yu et al. [26] found that embedding pretrained words into unlabeled data would have many NLP tasks significantly improved. As inspired by Glove [25], we used the word representation method based on global word frequency statistics to pretrain data on PMC (PubMed Central) and PubMed biomedical corpora and to embed pretrained word vectors into the model.

Character Representation.
Evidence has shown that character information is crucial for sequence labeling tasks [16,27]. Colobert et al. [28] suggested that the integrity of words can be used to label words, and local features extracted by CNN are exploited to construct all feature vectors. Ling et al. [29] tried to use character-level two-way LSTM for POS labeling, but the result of the experiment indicates that the performance of character-level two-way LSTM highly resembles CNN, but a heavier load of calculations is requested. Santos et al. [30] was the first researcher who suggested using CNN to learn character-level representations of words and associate them with the representations of common words. A number of subsequent works [31,32] supported that the word-level information (such as prefixes and suffixes) can be leveraged to the extent possible by character-based word representation. Zhao et al. [33] exploited attention-based CNN to capture the association between context-sensitive information and discontinuous words. Strubell et al. [34] proposed ID-CNN (Iterated Dilated Convolutional Neural Network) as the generally dilated CNN architecture that improves the computational efficiency to the extent possible. Chiu et al. [35] used CNN to extract character vectors of a specific length from the word-specific characters, cascade them with the encoded features, then transmit them through the convolutional layer and the max layer.
In this article, CNN was used to acquire the characterlevel representation of a word. As is seen from Figure 2, the feature encoding process as a part in Chiu et al. [35] was deleted, the Dropout layer was added to prevent overfitting of CNN, and we finally had a word-specific character vector.

Context-Sensitive Word
Embedding. ELMo, a language model based on features, can model words given the context. Unlike Word2Vec and other word sectors that use a simple lookup table to obtain the unique representation, the word sector in ELMo represents the function of the internal network state. Even for the same word, the word sector shows changes dynamically. us, it first adopts twoway LSTM for pretraining and the two-way concept of ELMo is reflected through the network structure, which comprises the forward LSTM model and the backward LSTM model. e construction of the model is shown in Figure 3.
ELMo comes with a task attribute and is a linear combination represented by the middle layer of biLM. With respect to a given word, biLM of a L layer can obtain the representation of 2L + 1: where w is the weight of softmax-normalized, x LM k denotes the input initial word vector, h →LM k,j denotes the forward LSTM output, and h ← LM k,j denotes the backward LSTM output. e context-sensitive dynamic word embedding as obtained from the above can more accurately reflect the complex semantic and grammatical features of the text.

Sequence
Labeling. Some deficiencies of the characterlevel model include the multiple growth of the effective sequence size and a lack of inherent meaning in the characters. us, RNN can be used to process time series data of any length using neurons with self-feedback. However, it was reported [36] that RNN is usually inclined to the nearest input of the sequence in practice and cannot process longterm dependencies. Certain variants based on recurrent neural networks, such as Gated Recurrent Unit (GRU) and LSTM, have proven extraordinary performance. Yang et al. [37] used GRUs at the character-and word-level to encode morphological and context-sensitive information. Huang et al. [38] were the first researchers who used BiLSTM for sequence sorting and results showed that this model is less dependent on word embedding and can capture two hidden states (forward and backward) of each sequence well with strong robustness.
Both DNER and DNEN can be seen as sequence labeling tasks. In this work, BiLSTM was used to model the input character-level information, pretrained word embedding, and contextualized word embedding. It inputs a vector sequence containing n words (x 1 , x 2 , . . ., x n ), then calculates the hidden state sequence (h 1 , h 2 , . . ., h n ), and outputs the label (o 1 , o 2 ,. . ., o n ). Finally, the equation with respect to an update of the LSTM unit would be as follows: where σ is elementwise sigmoid function, * is elementwise product, x t denotes the input vector at t, h t is the hidden vector (also referred to as "output vector"), it denotes the value of the memory gate, c t denotes the cell state, o t denotes the value of the output gate, W xi , W xc , and W xo denote the weight matrix of different gates of the input x i , W hi , W hc , and W ho are the weight matrix of the hidden state h t , and b i , b c , and b o denote the offset vector. en the final output vector can be obtained. After the training of BiLSTM, the entity labeling of unlabeled words can be predicted from the output ht. But in DNER task, some impossible combinations may also exist in the predicted data. For example, the label "I-BRAND" must not immediately follow the label "B-DRUG" logically, which means that we have to consider the label information of neighboring data. CRF is an undirected graphical model that focuses on the sentence level, instead of each position.
erefore, some impossible combinations should be ruled out.
With respect to the input sequence Y � {y 1 , y 2 , . . ., y n }, y n denotes the ith word vector of input, Z � {z 1 , z 2 , . . ., z n } is the label sequence of the input sequence Y, and P is the score matrix of output by BiLSTM, where k denotes score of the jth label of the ith word, and its score can be defined as follows: where A is the transition score matrix, A i, j denotes the conversion score from the label i to the label j, and y 0 to y n is   the start and end label of a sentence. ey are added to a set of possible labels. us, A is a matrix whose size is k + 2.
e loss function of CRF is composed of the actual path score and the total score of all possible paths; both scores are given as follows: where e s(Y, Z) denotes the score of the possible path along, where the Z label is generated on the word Y and e is a numeric constant. In the training course, the log probability of the correct label sequence is maximized.
log(P(Z|Y)) � log P Realpath P total , e loss function of CRF is computed by formula (5), where x iy i denotes the emission score with the word index as i and the label index as y i and t y i y i+1 denotes the transmit score with the word index as y i and the label index as y i + 1 . en, we can search for the optimal path using the Markov hypothesis, coupled with the Viterbi algorithm.

Multitask Learning Strategy. Multitask Learning (MTL)
is a kind of joint learning through which the differences and connections between tasks can be effectively analyzed and modeled. Hard sharing, soft sharing, and hierarchical sharing are currently the most-used structures by MLT. Hard sharing stacks a given task on top of the sharing layer [39]. Soft sharing supports each task with separate models and parameters, and the internal information contained in each model can be accessed [40], but it may also lead to the inefficiency of parameters. Hierarchical sharing puts different tasks in different network layers [41], but it relies on the handmade hierarchical shared structure. For DNER, since the same entity has a number of synonyms and various forms of representations, exact matching or fuzzy matching as lookup methods of the dictionary may cause great challenges to detecting entity boundaries. However, this can be avoided by adding the DNEN task. Specifically, the output of DNER such as "B-DRUG" is an explicit signal indicating the start of drug entity so that the search space of DNEN can be reduced, vice versa. erefore, two explicit feedback strategies were incorporated as a part of the multitask learning framework to simulate the reciprocal enhancement effect among different tasks.
A multitask learning framework resembling that proposed by Zhao [42] was used to enable DNER and DNEN to support each other and to enhance the generalization ability of the model. In the first step, the training set was divided into subsets applicable to T tasks: D 1 , . . ., D T prior to the training process. In the training process, a training set t was chosen and the instance for random training (w 1:n , y t 1:n ) ∈ D t was acquired, where w i ∈ W and W denotes the input set; y t i ∈ L t and L t denotes the label set. e label specific to the task t was used to predict the label y t i and update the label y t i and then the updated parameters were backfed to the model for asynchronous training of DNER and DNEN, with the particular equation written as shown in Figure 4.where DNER(w 1 :n, i) and DNEN(w 1 :n, i) denote the DNER and the DNE normalized function with the word sequence w 1 , w 2 , . . . , w n and the index i as inputs, y i DNER is the output of entity recognition applicable to the named entity label, y i DNEN is the output of the entity normalized function applicable to the entity vocabulary label, v DNER i is the input of DNER multiclass classification function that denotes the input of BiLSTM-CNN and the explicit feedback of DNEN, v DNEN i is the input of DNEN multiclass classification function that denotes the input of BiLSTM-CNN and the explicit feedback of DNEN. U is the matrix mapping from DNEN to DNER, and V is the matrix mapping from DNER to DNEN.
In this article, a fully shared mode was adopted to make the BiLSTM-CNN layer shared among tasks, which means that all parameters as contained in the model would be shared, except for the output layer applicable to DNER and DNEN. is construction enables the proposed model to capture feature representations of different tasks and interactively give feedback to generate prediction sequences.

Network Training
In this section, we provided particular information in relation to raining neural networks, including corpus, hyperparameter, optimizer, and assessment criteria. PyTorch was used to deploy the model and run the proposed model on Nvidia GTX 1080.

Datasets and Preprocessing.
Obtain data from the DDI2011 and DDI2013 challenge corpora to construct the data set for training the deep learning model, and preprocess the data set for training the deep learning model in the following ways: randomly divide the dataset into T subsets, and T is an integer greater than or equal to 2. Establish four alphabets of word, character char, label label, and feature for each subset. Each alphabet is a dictionary for storing {key: instance, value: index}, where key represents the stored key, Computational Intelligence and Neuroscience value represents the stored value, instance refers to the word, and index refers to the index. Based on the four alphabets of each subset, two lists are established for each subset. e two lists contain four columns of data, respectively. e four columns of data in the first list are [words, chars, labels, features], and the four columns of data in the second list are [words_Ids, chars_Ids, labels_Ids, features_Ids].
In the experiment, the DDI2011 Challenge Corpus from the drug-medicine interaction task was used. e minidom module as a part of python was used to extract <sentence> and <entity> elements, get the essential test and entity information, create a list, and match and annotate the entity and text. Next, all training datasets were collected as training data, and all test datasets were collected as test data. In this work, the sample was preprocessed using BIO labeling, where B denotes the first token of the entities in the sample, I denotes the token in the entity, and O denotes the token that does not fall into the category of entities. Table 1 lists the distribution of documents, sentences, and drugs as contained in the training and test set of DDI2011 [6]. Since there is only one type of entity names (DRUG) in this corpus, the text would be only labeled as "B/I-DRUG" or "O".
For further performance evaluation of the proposed model, the SemEval-2013 dataset in drug name recognition and classification task was used. Table 2 shows the numbers assigned to the annotated entities in DDI2013 training set and test set. e dataset contains four entity types: Drug, Brand, Group, and Drug_n [43]. Drug denotes any chemical reagent served to treat, cure, prevent or diagnose human diseases. Brand is characterized by trade name or brand name. Group denotes any term that specifies the chemical or pharmacological relations between a group of drugs as mentioned in the text, and Drug_n describes a kind of chemical reagent that has not been approved for human medical use.

Pretrained Embedding.
In this work, Pennington et al. [25] was used to initialize the word embedding obtained from the pretraining on PMC and PubMed, and the contextsensitive word vectors were acquired using ELMo. e character embedding was randomly initialized according to a uniform sample [− Table 3 lists the hyperparameters used in the course of experiment. e dimensions of pretrained word embedding, character embedding, and contextualized character embedding were set to 30, 100, and 1024, respectively. In the training process, the parameters were updated using Minibatch Stochastic Gradient Descent (SGD) in respect of descending learning rate. e initial learning rates of the proposed model, Dropout rate, and the batch size were set to 0.015, 0.5, and 10, respectively.

Criteria for Evaluation.
In the experiment, the system performance was evaluated by precision, recall rate, and F1. Precision represents all correctly predicted entities as a percentage of all predicted entities. Recall rate represents the predicted entities as a percentage of all entities as contained in the dataset. F1 represents the harmonized mean value of precision and recall rate, with the following equation: where TP denotes the number of true-positive samples, TN denotes the number of true-negative samples, FP denotes the number of false-positive samples, and F denotes the number of false-negative samples. Two out of four criteria for evaluation available in DDI2013 [43] Challenge Corpus were used: type matching (only if there are some overlaps with the same category of gold drug names) and strict matching (only if the label boundary and category are the same as the gold drug names, the label drug names are correct).

Experiment and Analysis
e multi-DTR model as described here was evaluated on DDI2011 and DDI2013, known as the representative biomedical corpora. Table 4 is the performance comparison of multi-DTR with the works done by other teams. Next, the impact of each architecture (e.g., different embedded layers, different optimization methods, and multitask mutual feedback framework) as a part of the proposed model on the    Training  435  4267  11260  Final test  144  1539  3689  Total  579  5806  14949   6 Computational Intelligence and Neuroscience experiment was assessed. e findings of comparison suggest that the architectures of the proposed model would perform well in the experiment.

Performance Comparison with Available Methods.
e results were compared with those of the works done by other teams. For the sake of fairness and rationality of the experiment, the hyperparameters of the proposed model were configured according to the optimal parameters as referred to in the article. As is seen from Table 4, the dictionary-based method and the rule-based method, as proposed earlier, yielded reasonable results, including Tsuruoka [12] and Hettne et al. [13], subsequent deep learning model. For example, LASIGE et al. [43] combined CRF with the list of dictionary terms intended for DNER processing as collected from the database in order to recognize and classify entities. Zeng et al. [16]used the BiLSTM-CRF structure to identify drug entities without the aid of any external dictionary, with good results attained. Yang et al. [37] used a hierarchical recursive network for cross-language transfer learning. e model proposed by Liu et al. [44] combines the word embedding trained in biomedical text with the semantic features of three drug dictionaries, with an impressive performance on DDI2013, suggesting that the accuracy of our proposed model is 0.90% lower than that proposed by Liu et al. [44], but its recall rate and F1 are 6.23% and 2.43% higher than that proposed by Liu [44].
For the evaluation of DDI2013 dataset, Table 5 provides a summary of the accurate evaluation of the proposed model in the entity type-specific recognition as part of DDI2013.
Despite good performance in type recognition, the proposed model may neglect the difference between a given entity and other entity types due to a small percentage (<4%) of Drug_n entity type in the dataset. As a result, the recognition accuracy of the proposed model would be lower than that of any other entity.

Performance Comparison of Different Statements.
is work proposed using pretrained word embedding, character representation, and context-sensitive word embedding to obtain additional feature information, as given in Table 6. To test the impact of different input information representations on the proposed model, three kinds of embedding information were combined and input into the model, respectively. According to the results, serial representation is better than single representation, and multiple representations can attain the best performance.

Comparison of Optimization
Methods. Different optimizers, including SGD, AdaGrad, Adadelta, RMSProp, and Adam, were compared here. SGD can calculate gradient and update parameters by randomly extracting the training sample of a fixed size while avoiding falling into saddle points or poor local optimal points. AdaGrad imposes a constraint on the optimal learning rate and is suitable for processing sparse gradient, but it may cause the disappearance of gradient. Adadelta is an extension of AdaGrad and simplifies the computational process. RMSProp relies on a global learning rate and is suitable for processing nonstationary targets. Adam can adjust the parameter-specific learning rate using first-order moment estimation and second-order moment estimation, but it is vulnerable to generalization and convergence problems. According to the experimental results, as given in Figure 5, SGD is significantly better than any other optimizer.

Performance Comparison in Case of Dropout.
e effectiveness of Dropout was evaluated here, with all of the other hyperparameters in the model identical to that in Table 3. As given in Table 7, the performance of the proposed model on DDI2011 and DDI2013 was slightly improved after the Dropout was used, which in turn proves that Dropout plays a part in reducing overfitting.

Performance Comparison between Multitask Learning and Single-Task Learning.
e effectiveness of multitask learning strategy was also examined. As seen from Table 8, the efforts to jointly model DNER and DNEN by using two explicit feedback strategies would significantly improve the       Computational Intelligence and Neuroscience model performance, partly because the multitask learning provides a general representation of both tasks and partly because the proposed method converts hierarchical tasks into parallel multitask setting and retains mutual support between different tasks.

Conclusion
Drug text mining is a key interdisciplinary field of computer science and biomedicine. In this work, a multitask learning framework was tailored for DNER, with an impressive performance on DDI2011 and DDI2013. rough detailed analysis, the main gains of the proposed model can be attributed to character sharing between drug entities, pretrained word embedding, and context-sensitive word embedding information. e conflict of entity boundary and type can be generally resolved by the positive feedback of DNER and DNEN. According to the experimental results, the proposed method can readily perform well without the aid of any drug dictionary or manual creation so an efficient DNER system was constructed.
Data Availability e experimental datasets used in this work are publicly available, and the bundled data and code of this work are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Computational Intelligence and Neuroscience 9