Detecting Personal Medication Intake in Twitter via Domain Attention-Based RNN with Multi-Level Features

Personal medication intake detection aims to automatically detect tweets that show clear evidence of personal medication consumption. It is a research topic that has attracted considerable attention to drug safety surveillance. This task is inevitably dependent on medical domain information, and the current main model for this task does not explicitly consider domain information. To tackle this problem, we propose a domain attention mechanism for recurrent neural networks, LSTMs, with a multi-level feature representation of Twitter data. Specifically, we utilize character-level CNN to capture morphological features at the word level. Subsequently, we feed them with word embeddings into a BiLSTM to get the hidden representation of a tweet. An attention mechanism is introduced over the hidden state of the BiLSTM to attend to special medical information. Finally, a classification is performed on the weighted hidden representation of tweets. Experiments over a publicly available benchmark dataset show that our model can exploit a domain attention mechanism to consider medical information to improve performance. For example, our approach achieves a precision score of 0.708, a recall score of 0.694, and a F1 score of 0.697, which is significantly outperforming multiple strong and relevant baselines.


Introduction
Social media (Twitter, Facebook, etc.) encourages users to frequently express their thoughts, opinions, and other personal information of their lives. Existing studies have demonstrated that social media messages can provide knowledge for tracking public political opinion [1], detecting news events [2], and tracking the spread of infectious diseases [3]. Some research has also shown that social media can be used as a resource for mining public health information [4][5][6], especially in cases where the health data from official institutions is not readily available.
In this work, we focus on the task of medication intake detection, which is an individual level surveillance of drug safety [7] in the public health domain. e goal of the task is to automatically detect tweets that express clear evidence of personal medication consumption. For example, messages like indicate the author intakes medicine while indicates no intake.
ere are two types of existing methods for this emerging task in the public health domain. e first involves applying traditional classification algorithms with handcrafted features. For example, Kiritchenko et al. [8] exploited an SVM classifier with a variety of features for this task. Another one excludes the time-consuming effort of designing hand-crafted features by using deep neural networks and directly feeding word embedding as the input. Several methods based on CNNs (Convolutional Neural Networks) have achieved good performance [9][10][11].
Both of them have limitations. Traditional classification methods lack nonlinear mapping ability, although they can make full use of domain knowledge information (e.g., drug lexicon and domain word clusters) [8]. e neural-based methods based on pre-trained embeddings of words [12,13] are domain-independent and are not very effective for a specific task [14,15], for example, our medicine intake classification task. In addition, it is not always possible to train task-specific word embeddings due to limitations on training resources. Despite some efforts to solve the problem of domain relevance at the feature level, for example, sentiment analysis on Nepali COVID-19 tweets [16], no relevant research work considering domain information has been found on our task.
To deal with the aforementioned limitations, we introduce a domain attention mechanism for recurrent neural networks with multi-level inputs to learn an informative representation of tweets. e attention mechanism enables the model the ability of learning domain-specific (medicine) matrix representation, which automatically weights the words in the text accordingly in the medication intake detection task. Meanwhile, the proposed model considers both word-level and character-level features as input features of the network. A prominent advantage of using character-level representation is that it is beneficial for many text analysis tasks [17][18][19], especially for informal text [20,21], for example, tweets.
In particular, the proposed model generates word representations using a character-level CNN, which are fed to a highway network. We then concatenate them with pretrained word embeddings, before feeding them to a BiLSTM network. Subsequently, as previously mentioned, the BiLSTM is introduced with an attention mechanism to distinctively attend on different words while learning the representation of the text. e attention-based BiLSTM also learns the representation of higher-level features in the whole text sequence of a tweet. Finally, softmax is applied to the final tweet representation for the classification task. We compared the experimental results obtained using our method with several strong and relevant baselines. We observe that our approach, with a micro-averaged F-score of 0.697 for Classes 1 and 2, achieves better performance on all other methods except ensemble approaches, which are more efficient than the standard approach. Altogether, this work introduces a novel attentional RNN framework with multilevel features that can effectively be applied to the personal medication intake detection task.

Related Work
Personal medication intake detection belongs to a short text classification task. Traditional representative methods for this task include statistical machine learning methods and deep learning methods. e vast majority of the first category is based on the vector space model, which is a typical method for tweet classification [22,23]. Wang et al. [24] developed an SVM-based text classification algorithm. Chen et al. [25] and Jiang et al. [26] exploited the Naive Bayesian (NB) approach and KNN for this task, respectively. Wan et al. [27] implemented a new document classification method by integrating KNN and SVM, while Rogati et al. [28] investigated a large number of feature selection methods for text classification. However, these methods heavily depend on feature engineering, which cannot represent the grammatical and deep semantic information of words well.
Deep learning methods can automatically select features and therefore have become the mainstream methods for text classification in recent years. e first step is to learn word representations using related methods [29][30][31]. Based on them, researchers initially adopted the CNN-based method to classify texts [32,33]. Collobert et al. [33] extracted local features by using a convolutional layer. Kim [34] constructed a single-layer convolution network for sentence classification. Kalchbrenner et al. [35] proposed a CNN model with multi-layer dynamic k-Max pooling, taking random lowdimensional word vectors as input. Er et al. [36] developed an attention-based pooling component, which has the ability to obtain more semantic information. Yin et al. [37] developed a multi-channel variable-size CNN, which can support multiple pre-trained word embeddings and variable-size convolution kernel to obtain multi-granularity phrase features. Recently, the RNN-based model shows good performance. Lee et al. [38] exploited a convolutional recurrent neural network to process long text sequences. Lai et al. [39] proposed a bi-directional recurrent structure that can utilize the context information of words to classify text.
In addition, the participating systems of the SMM4H shared task are related to our method. ese systems can be also divided into traditional statistic methods [8,40] and neural network methods [10,41]. Due to the characteristics of pursuing high-performance scores in evaluation tasks, most of them used ensemble technology. More details can be referred to the literature [9].

Personal Medication Intake Detection.
e primary objective of the personal medication intake detection task is the automatic classification of tweets mentioning medication intake, which is an emerging research topic in the public health domain based on social media. is is a three-class text classification task. Each medicine-mentioned tweet needs to be grouped into one of three categories: definite intake, non-intake, and possible intake. e details of these categories are as follows.
(i) Define intake (Class 1)-e user expresses clear evidence of personal medication consumption, for example, "Benadryl and Tylenol are the only things saving me at night these last few nights." (ii) Possible intake (Class 2)-It is suggested that one poster may have taken the medication, but there is no clear evidence, for example, "I would love to intravenously pump Motrin and caffeine into my body immediately." (iii) Non-intake (Class 3)-ere is no evidence showing that the user has consumed the medication, while it only mentions medication names, for example, "stay out of the heat, only drink water, and stay off your feet for a day or two. Tylenol is all you can take for pain."

Character Convolutional Neural Network (CNN).
Character Convolutional Neural Network (C-CNN) [17,42] is fed characters instead of words, as in traditional CNN.
Given an input word w, which can be seen as a character sequence C � c 1 , c 2 , . . . , c n , where n is the length of the word. e C-CNN applies the convolution operation on the character sequence to generate the feature map V 1 as follows: where Conv denotes the convolution kernel and W and b are learnable parameters. In practice, there are different convolution kernels for catching various features. A pooling layer, which is utilized to compress and obtain crucial features for the next layer, is usually applied after the convolution layer. e computing process can be written as ere are two common pooling operations: max pooling and mean pooling. For example, max pooling chooses the maximum value in a pooling window as the output result of the pooling process. Several combinations of convolution and pooling layers could be used in practice for specific tasks.

Bi-Directional Long Short Term Memory (BiLSTM).
e Long Short Term Memory (LSTM) network was introduced by Hochreiter et al. [43] and was refined and promoted by many works [44][45][46]. e LSTM solves the long-term dependency problem in the RNN model [47]. Given a sequence X � x 1 , x 2 , . . . , x n as input, the operations performed by the LSTM units are as follows: where h t is the output of the LSTM at time step t. W and b are the weights and bias, respectively, and δ is a sigmoid layer. In many NLP tasks, a bi-directional LSTM is used to obtain forward and backward information of words in a sequence. In BiLSTMs, it concatenates the outputs of the forward and backward hidden states as its output:

Methods
Our proposed model combines the C-CNN and BiLSTM. It also introduces an attention mechanism in the BiLSTM for the personal medication intake detection task. e model consists of a character-level word embedding component, a word-level feature representation component (Character Language Model, CLM) that uses C-CNN, a sentence-level feature representation component using BiLSTM, and a Domain Attention Component (DAC). An overview of our model is shown in Figure 1.
Since tweets are mostly informal, traditional word embeddings cannot represent it well. erefore, we use a Character Language Model (CLM) to capture morphological features at the word level. Firstly, a character embedding e c is created for each character in a word. Our model then converts the character embedding sequence into a vector using a CLM, which is a kind of C-CNN network. e structure of the CLM is as shown on the right in Figure 1.
Specifically, for every word w in a sentence, after passing it to convolutional and max pooling layers, our model utilizes a highway network [42,48] to regulate the information flow: where H is a nonlinear function, V 2 is calculated by equation (2), and t and (1 − t) are called the transform gate and carry gate, respectively.
After obtaining C i , the representation of the i-th word at character-level, we concatenate it with its word embedding e i to generate the final representation of the word: Subsequently, we feed a sentence s � (e 1 , e 2 , e 3 , . . . e n ) into a BiLSTM network to get the hidden states h � (h 1 , h 2 , h 3 , . . . , h n ). In our experiments, we treat each tweet as one sentence and yet achieve good results since most of the tweets in our dataset are too short and mostly contain one or two sentences.
At this stage, the model performs a general processing on tweets. erefore, for the medicine intake detection task, we introduce a DAC to attend to the specific domain information that is being used to detect a specific condition. e DAC aims to weigh the informative words for medicine intake highly. First, the result h k obtained by inputting h k into a single-layer perceptron is used as the hidden representation of h. e weight value of a word is determined by the similarity of h k and a parameter D, here D can be seen as a domain context vector. After processing using a softmax function, a normalized attention weight matrix is obtained, which indicates the weight of each word in a sentence. Finally, the tweet representation u can be calculated as the weighted summation of the words in it. e output is computed as follows: Computational Intelligence and Neuroscience where W a and b a are the weight and bias, respectively. a k stands for the attention value of the k-th word and measures the weight of each word in the sentence. e vector representing the whole text sequence from a tweet or the tweet vector u is a higher-level representation and can be used directly as a feature for medicine intake detection: e final optimization objective is to minimize the negative log likelihood of the correct labels: where p represents the ground-truth label of the tweet.

Dataset.
Our experiments were conducted on a publicly available dataset from the 2nd SMM4H (Social Media Mining for Health) Shared Task on the AMMIA 2017 website. Using the Twitter download script and the tweet dataset description file provided by the organizers, we did not collect all the tweets since some of them are not available. Table 1 summarizes the statistics for the dataset. Classes 1, 2, and 3 stand for personal medication intake, possible medication intake, and no medication intake, respectively. Our training dataset is a combination of the originally provided training and validation datasets. We utilize 10-fold cross-validation when training our model.

Model Configuration and Training.
We use pre-trained word embeddings to initialize the input of the neural network model. is is highly useful for NLP tasks [49,50]. In our work, we use the embeddings trained by a word2vec model on Twitter data [51], which are of 400 dimensions. For character embeddings, we use a random initialization since there are no publicly available character embeddings in this case.
Within our experiments, we have two types of parameters, hyper-parameters and other settings. Specifically, the character embedding dimension is 15, the dimension of the hidden layer is 300, and the CLM has filters of width [1,2,3,4] of size [15,30,45, 60] for a total of 180 filters. Additionally, the batch size, the learning rate, the dropout rate, and the L2 normalization factor are set to 100, 3e − 4, 0.3, and 5e − 7, respectively. In our training process, we used early stopping with a patience value of 40.

Baselines.
We conducted comparative studies involving experiments with several baseline methods on the dataset, including neural network methods, traditional machine learning algorithms, and state-of-the-art methods for this task. In the first category, we choose the NB and SVM algorithms: NB is a Naive Bayes classifier in which n-grams (n � 1, 2, 3) are used as features. SVM is a Support Vector Machine classifier with ngrams (n � 1, 2, 3) features. e neural network model is currently the dominant method for text processing. We chose the following representative methods:  BiLSTM uses a traditional bi-directional LSTM model for medicine intake detection, which represents a sentence with the hidden state of the last word of it. CharCNN [18] is a classical model which performs text classification by using a character-level convolutional network.
AttRNN [52] concatenates the last hidden state, the first hidden states of an RNN with an attentive representation of a hidden state sequence as the features of a text sequence. e third group is the top three systems from the SMM4H Shared Task: InfyNLP [10] is the first system in the 2nd SMM4H Shared Task at AMIA 2017. It uses a stacked ensemble of shallow CNNs modeled as a classifier for this task. UKNLP [41] is the second system in the Shared Task which utilizes a CNN network with a self-attention component. NRC-Canada [8] exploited the SVM classifier with a variety of hand-crafted features, which is the third system on the SMM4H Shared Task. Table 2 presents the performance of the different methods. We presented the micro-averaged Precision, Recall, and F1 scores of Class 1 (personal medication intake) and Class 2 (possible medication intake). e best results are shown in bold text. e results with " # " are copied from their original papers. Following the setting in the SMM4H shared task, we report the micro-averaged scores over Class 1 and Class 2. It was observed that the proposed model performs the best over the F1 score against strong baselines and top systems in the SMM4H shared task. Compared with other neural classification methods, the proposed domain attention component and CLM in our model improve the performance in the context of the task at hand. CharCNN and AttRNN methods perform less than our method, while BiLSTM performs poorly in this group because they are proposed for general text classification tasks, for example, topic classification. e former two methods perform better than BiLSTM because they introduce characterlevel information and attention components. NB and SVM, as they are represented in traditional machine methods, perform poorly because they cannot fully capture text semantic information compared to the NN model. At the performance of the top three systems is not as good as our method.

Ablation Test.
In this subsection, we discuss the impact and contribution of the different components of our model. Specifically, we tested 3 settings. e first, we dismissed CLM only. In this case, the model did not capture character-level features. In the next setting, we remove the DAC only. Similarly, the model did not care about the domain information. Finally, we dismiss both CLM and DAC. In this case, the model degenerates to BiLSTM, which only just uses word-level features via BiLSTM encoding. Table 3 reports the results of this ablation study.
It is clear that both CLM and DAC are critical to the performance of our model. Removing one or both of them can cause performance degradation. In particular, we also observe that CLM seems to be less important than DAC, which means that the performance drops more as compared to removing CLM.

Conclusion and Future Works
Personal medication intake detection, aiming to automatically detect tweets that express clear evidence of personal medication consumption, is an essential research topic in the surveillance of drug safety. In this work, we proposed a domain attention component for recurrent neural networks, for example, LSTM, with multi-level feature representation of text from Twitter. rough experiments on the public benchmark dataset, we validated the performance of our model. Our model obtains the best performance of 0.697 F1 score. Compared with multiple strong baselines, it showed a significant performance improvement.
Our method still has limitations on domain-specific knowledge representation due to the representation ability of the neural network model itself. us, it would be interesting to combine the knowledge base, for example, Knowledge Graph, with our model to obtain richer domain information for this task.

Data Availability
e data used to support the findings of this study have been deposited in the website https://healthlanguageprocessing. org/sharedtask2/smm4h-sharedtask-2017/

Conflicts of Interest
e authors declare that they have no conflicts of interest to report regarding the present study.