Interactive Dual Attention Network for Text Sentiment Classification

Text sentiment classification is an essential research field of natural language processing. Recently, numerous deep learning-based methods for sentiment classification have been proposed and achieved better performances compared with conventional machine learning methods. However, most of the proposed methods ignore the interactive relationship between contextual semantics and sentimental tendency while modeling their text representation. In this paper, we propose a novel Interactive Dual Attention Network (IDAN) model that aims to interactively learn the representation between contextual semantics and sentimental tendency information. Firstly, we design an algorithm that utilizes linguistic resources to obtain sentimental tendency information from text and then extract word embeddings from the BERT (Bidirectional Encoder Representations from Transformers) pretraining model as the embedding layer of IDAN. Next, we use two Bidirectional LSTM (BiLSTM) networks to learn the long-range dependencies of contextual semantics and sentimental tendency information, respectively. Finally, two types of attention mechanisms are implemented in IDAN. One is multihead attention, which is the next layer of BiLSTM and is used to learn the interactive relationship between contextual semantics and sentimental tendency information. The other is global attention that aims to make the model focus on the important parts of the sequence and generate the final representation for classification. These two attention mechanisms enable IDAN to interactively learn the relationship between semantics and sentimental tendency information and improve the classification performance. A large number of experiments on four benchmark datasets show that our IDAN model is superior to competitive methods. Moreover, both the result analysis and the attention weight visualization further demonstrate the effectiveness of our proposed method.


Introduction
Sentiment analysis has been a hot topic in the field of Natural Language Processing (NLP) in recent years. With the rapid development of social networks and e-commerce, a large amount of text data with user sentiments has been generated on the Internet. Sentiment analysis for these data has significant application value [1][2][3]. Text sentiment classification is a subtask of sentiment analysis which aims to identify the sentiment polarity (e.g., positive and negative) of a text [4].
Traditional machine-learning-based sentiment classification methods mainly focus on artificially designing a set of features, such as sentiment lexicon or bag-of-words features, to train classifiers [5]. However, this type of method is usually time-consuming and laborious.
In contrast, deep learning methods can learn the feature representation automatically instead of hand-crafted features, which have been used in various NLP tasks such as machine translation [6], reading comprehension [7], and sentiment classification [8][9][10]. Word2Vec [11] and GloVe [12] are word embedding techniques that are often used in deep neural networks for word feature representation. However, the Word2Vec and GloVe methods give static and context-independent word vectors, which cannot well represent the semantics of words in different contexts.
proposed, which can generate context-aware dynamic word embedding representation and can model context semantics better [15].
Although context-aware semantic representation can be obtained through the BERT language model, the expression of sentimental tendency is still insufficient. Some studies have integrated linguistic resources (e.g., sentiment lexicon) into models to improve the sentimental tendency expression ability of neural networks [16][17][18]. Nevertheless, these studies have not adequately considered the possible interaction between contextual semantics and sentimental tendency.
is paper proposes a novel model called Interactive Dual Attention Network (IDAN), which is intended to utilize the interaction between contextual semantics and sentimental tendency information for sentiment classification.
First, we design an algorithm combining sentiment lexicon, intensity, and negative words to extract sentimental tendency information from text. e context-aware dynamic word embedding representation obtained through the BERT pretraining model is used as the embedding layer of IDAN. Next, we use two Bidirectional LSTM [19] (BiLSTM) networks to learn the long-range dependencies on contextual semantics and sentimental tendency information, respectively. Since the attention mechanism allows the network to focus on the important parts of the text sequence [20,21], two types of attention mechanisms are implemented in IDAN. One is multihead attention [22], which is the next layer of BiLSTM and is used to learn the interactive relationship between contextual semantics and sentimental tendency information. e other is global attention [21] that aims to make the model focus on the important parts of the sequence and generates the final representation for classifier. e main contributions of this paper are as follows: (i) An architecture of Interactive Dual Attention Network (IDAN) is proposed, which aims to implement interactive learning between contextual semantics and sentimental tendency information for sentiment classification (ii) An algorithm to extract sentimental tendency information is proposed (iii) IDAN is extensively evaluated on four benchmark datasets. Experimental results demonstrate that IDAN outperforms the competitive methods e rest of this paper is organized as follows. In Section 2, related work of sentiment classification is introduced. Section 3 presents the details about the IDAN architecture and its implementation. Section 4 gives the experimental result and analysis. Finally, we conclude our research in Section 5.

Related Work
In this section, we will briefly introduce traditional methods for sentiment classification and focus on reviewing deep learning methods.

Traditional Methods.
Traditional lexicon-based methods use existing resources such as sentiment lexicons and some linguistic rules to identify the sentiment polarity of text [23,24]. However, these methods rely heavily on the construction of sentiment lexicons; thus there are few methods that only use lexicons for sentiment classification.
e key of sentiment classification methods based on traditional machine learning is to manually design suitable features for classifiers. Pang et al. [5] first proposed a standard machine learning method to solve sentiment classification problems, in which they attempted to construct different features for three classifiers: Naive Bayes (NB), Maximum Entropy (ME), and Support Vector Machine (SVM). eir experimental results show that SVM combining with unigram features is better than NB and ME algorithms. Furthermore, lexicon information was integrated with SVM to improve the performance of sentiment classification [25].

Deep Learning Methods.
Due to the powerful expression ability, deep learning models have achieved remarkable results in numerous fields. For NLP, Recurrent Neural Network (RNN) [19] is quite popular because it can handle variable-length sequences well. us, RNN is usually used as the basic network structure of sentiment classification [26]. On the other hand, CNN has achieved excellent results in the field of computer vision [27]. In addition, Kim [28] also used CNN for sentiment classification, which shows that unsupervised pretraining of word vectors may be an important ingredient for NLP.
Furthermore, Wang et al. [29] proposed an architecture that combines CNN and RNN for sentiment classification.
is architecture makes use of the local features captured by CNN and the characteristics of long-distance dependencies learned through LSTM or Gated Recurrent Unit (GRU). Tang et al. [30] proposed a model that encodes the intrinsic relations of sentences in semantic meaning, which uses LSTM or CNN to obtain sentence representations and then uses gated recurrent neural networks to aggregate them to obtain document representations. Recently, attention mechanism has been successfully applied in sentiment classification tasks. Yan and Guo [31] proposed a method for text classification using contextual sentences and attention mechanism. Yang et al. [32] proposed a Hierarchical Attention Network (HAN) for document sentiment classification, in which the model can selectively focus on important single words or sentences when constructing the document representation.
In order to enhance sentimental tendency expression, some studies have integrated linguistic resources or some external knowledge into models to enable the network to learn sentiment-specific expressions. Tang et al. [33] encoded sentiment information into the continuous representation of words to learn Sentiment-Specific Word Embeddings (SSWE), which is more suitable for sentiment classification tasks. Qian et al. [16] proposed linguistically regularized LSTM for sentence-level sentiment classification, in which the proposed model addressed the sentimental shifting issue of the sentiment, negation, and intensity words. Besides, some studies also incorporated external knowledge (e.g., sentiment lexicons) into deep learning models for sentiment classification [17,18,34].
More recently, Lei et al. [35] proposed a hierarchical sequence classification model based on BERT and applied it to microblog sentiment classification. However, these methods have not considered the possible interaction between contextual semantics and sentimental tendency. erefore, our proposed IDAN method uses the contextaware word embedding as the embedding layer and combines it with BiLSTM as well as attention mechanisms, which aims to conduct semantic modeling for a specific context and learn the interactive representation between contextual semantics and sentimental tendency information.

The Proposed Approach
In this section, we will first introduce the overall architecture of our IDAN briefly and then describe the details of the proposed method. e overall architecture of the IDAN model is shown in Figure 1. e model contains two input parts: context and sentimental tendency information, which model contextual semantics and sentimental tendency, respectively. e hierarchical structure of the model is divided into five layers. e first one is the embedding layer, which converts the text sequence into a word embedding matrix. en there is the BiLSTM layer that is used to model the semantic representation in long sequences. e third layer is the interaction layer, which is used to learn the interactive representation of contextual semantics and sentimental tendency information. e fourth layer is the global attention layer, which aims to combine the last output of BiLSTM to capture important information of sentimental polarity in the sequence after interactive learning. e last layer is the output layer with a soft max classifier.

Sentimental Tendency Information Extraction.
e text sentimental tendency information elements are the combination of words or phrases with sentimental tendency. In order to extract these elements, some external resources such as sentiment, intensity, and negative lexicon are utilized.
Here, we denote the set of sentiment, intensity, and negative lexicon by S, I, and N, respectively. Consider a dataset C containing K texts, in which c i represents the i-th text. We scan the text in order and define a continuous word sequence according to the j-th word w j as s(w j ) � w j−2 w j−1 w j . e corresponding sentimental tendency element e j can be obtained by the following extraction criteria: where ⊗ means the Cartesian product of two sets and N and I denote the complements of sets N and I, respectively. e pseudocode of extracting procedure is given in Algorithm 1.

Remark 1.
Since N ∩ I � ∅, N ∩ S � ∅, and I ∩ S � ∅, there is no conflict according to the extracting criterion.

Remark 2.
We think that the sentiment word is the most important tendency information; thus each element e must contain a word coming from set S.

Remark 3.
Grammatically, both of the intensity and negative words embellish the sentiment words, so the sentiment word is usually in the last position of each element e.

Embedding Layer.
Compared with the context-independent static word embedding, BERT can generate contextaware dynamic word embedding representation. us, we use BERT to obtain word embedding representation for the context and sentimental tendency information. Here, w ∈ R d denotes a real-value word vector, where d is the dimension of word embedding. Suppose that the context consists of n words, and its corresponding word embedding matrix is denoted as [w c 1 , w c 2 , . . . , w c n ], where the superscript c refers to the term context. Similarly, if the sentimental tendency information has m words, its corresponding word embedding matrix is denoted as [w s 1 , w s 2 , . . . , w s m ]. As shown in Figure 1, these two matrices are the inputs in the IDAN architecture.

Bidirectional LSTM Layer.
Because the words in a sentence have strong dependence with their context, we use BiLSTM [36] in this layer. e BiLSTM includes a forward LSTM that reads from the head to end of the sentence and a backward LSTM that reads from the opposite direction. Compared with LSTM, the BiLSTM can get more abundant information. erefore, we utilize two BiLSTM networks to learn hidden states of context and sentimental tendency information, respectively.
An LSTM cell contains an input gate i, a forget gate f, an output gate o, and a memory cell c. In general, at each time step t, given the input word embedding w t , previous cell state c t−1 , and hidden state h t−1 , the current cell state c t and hidden state h t in the LSTM networks are updated as where W f , W i , and W o represent the weight matrix and b f , b i , and b o represent the bias value learned by the LSTM during the training process. σ represents the sigmoid activation function. e symbol · represents matrix multiplication and ⊙ represents element-wise multiplication. e forward LSTM hidden state h → t and backward LSTM hidden state h ← t at time step t in the context part of the model are expressed as en, the hidden state of BiLSTM at time step t is expressed as where the operator ⊕ represents concatenation. After the above operation, we can obtain the contextual semantics

Interaction Layer. After the BiLSTM step, the contextual semantics representation
are obtained. We further use the multihead attention mechanism to learn the interactive representation between the contextual semantics and the sentimental tendency information.
e multihead attention is calculated and spliced by multiple scaled dot-product attention, which has three input matrices: Query (Q), Key (K), and Value (V). In the field of NLP, the Key and Value are usually equal [22]; that is, K � V. e scaled dot-product attention structure is shown in (a) in Figure 2 and is calculated as follows: where 1/ �� d k is the scaling factor. Figure 2(b) shows the structure of multihead attention, which consists of H parallel scaled dot-product attention layers. e multihead attention (here denoted by MHA) can be obtained by the following equations: where In the interactive representation calculation of the context part, the multihead attention has three inputs Input: e dataset C. e set S, I, and N. Output: Sentimental tendency information set T.
obtain element e j using equation (1); (6) If e j ≠ ∅ then (7) add e j to t i ; (8) End If (9) End For (10) add t i to T; (11) End For (12) Return T ALGORITHM 1: Sentimental tendency information extraction. 4 Computational Intelligence and Neuroscience denoted by Q, K, and V, where Q denotes the contextual semantics and K and V denote the sentimental tendency information. Figure 3

Global Attention Layer.
In this layer, we use the global attention mechanism to capture the important information of the input sequence and generate an attention representation. As shown in Figure 1, in the context part, the attention infers a variable-length alignment weight vector α n based on the last output b c n of BiLSTM and all output states e alignment weight vector α n is calculated as follows: where c is a score function that calculates the importance of . e score function c is defined as where W a is a weight matrix and b cT n is the transpose of b c n . A global context vector c n is then computed as follows: Finally, the attention representation a c in the context part of the model is calculated as follows: where tan h is a nonlinear activation function and W c is a weight matrix. Similarly, we can obtain the attention representation a s of sentimental tendency information.
3.6. Output Layer. After attention representations of contextual semantics and sentimental tendency information are obtained, we connect these two vectors into a vector v and use it as the input of a linear layer, in which a softmax classifier is implemented for C sentiment polarity categories. e probability with sentiment polarity i(i ∈ [1, C]) is calculated by equations 13) and (14), setting the prediction label to the category with the highest probability value.
where W v and b v are weight matrix and bias, respectively, and y i represents the probability that the input sample belongs to category i.
In the model, we denote all network parameters by Φ. Since L 2 regularization can prevent the model from overfitting, we use cross entropy with L 2 regularization as the loss function and try to optimize Φ. e cross-entropy loss function with L 2 regularization is defined as

Computational Intelligence and Neuroscience
where T is the training set, C is the number of categories, and g t is the category vector of sample t, which is denoted by the one-hot form. y t i denotes the distribution of predicted sentiment categories, and λ is the regularization coefficient.
In summary, our IDAN neural network shown in Figure 1 can be expressed in a series of equations. Concretely, given the context con and sentimental tendency information sen (obtained by Algorithm 1), the embedding matrices w c and w s can be obtained as follows: w c , w s � Embedding(con, sen), where Embedding represents the embedding layer transformation. Next, the hidden states b c and b s can be calculated as follows: where BiLSTM is the bidirectional LSTM layer transformation (implemented by equations (3)- (5)). en we can get the interactive representations h c and h s of the hidden state with respect to the context and sentimental tendency information using the following equation: where Interaction represents the interaction layer transformation (implemented by equations (7) and (8)). e global attention representations a c and a s can be obtained as follows: a c , a s � Global h c , h s (19) where Global is the global attention layer transformation (implemented by equation (12)). Finally, the sentiment polarity y i can be calculated as follows: where Output represents the output layer transformation (implemented by equation (13) and (14)).

Experiments
In this section, four benchmark datasets will be introduced, and then the detail of linguistic resources used in this experiment, evaluation metrics, and hyperparameters setting are given. Next, eight comparable baseline methods will be listed and explained briefly. Finally, the experimental results and analysis are presented, which include performance comparison, ablation experiment, and case analysis.

Datasets.
e experiments were evaluated on two Chinese datasets and two English datasets, which are described as follows: (i) ChnSentiCorp (available at https://www.aitechclub. com/data-detail?data_id�29): a Chinese hotel review dataset collected by professor Songbo Tan. In the experiment, we chose a balanced corpus containing 6000 reviews that involve positive/negative reviews, which were randomly divided into 80% training set and 20% test set.  [37]. e classification task is positive/negative review discrimination. We randomly divided 80% of them as the training set and the remaining 20% as the test set.
e summary of these datasets is shown in Table 1, where l represents the average length of the review corpus, |V train | is the training set size, and |V test | represents the test set size.

Linguistic Resources.
For English data, we utilized linguistic resources published by HowNet (available at http:// www.keenage.com/html/c_index.html) to extract sentimental tendency information. For Chinese data, the linguistic resources used to extract sentimental tendency information came from Jianlin Su (available at https://kexue. fm/archives/3360). ese resources are summarized in Table 2. It should be noted that each Chinese word was attached to its corresponding English explanation in the following examples.

Evaluation Metrics.
We used Accuracy and Macro − F1 as evaluation metrics to evaluate the performance of IDAN. Accuracy is one of the most commonly used evaluation metrics in classification tasks, which is defined as follows: where T and N represent the numbers of samples that the classifier predicted correctly and predicted incorrectly, respectively. Compared with Accuracy, the Macro − F1 score first calculates the Precision and Recall of each category separately.
e average of all Precision and Recall is Precision macro and Recall macro , respectively. en, Precision macro and Recall macro are utilized to calculate the Macro − F1 score. e calculation formula is as follows: 6 Computational Intelligence and Neuroscience where C is the number of categories. TP i , FP i , and FN i are the numbers of true positive, true negative, and false negative of category i, respectively.

Hyperparameters Setting.
In our experiment, the word embedding of the IDAN model was extracted from the BERT (the English and Chinese pretrained BERTmodels can be obtained from https://github.com/google-research/bert and https://github.com/ymcui/Chinese-BERT-wwm, respectively) pretraining model with a dimension of 768. e number of neurons in the BiLSTM layer was set to 256, and the number of attention heads of the multihead attention was set to 8. All weight matrices were initialized by Glorot uniform, and all biases were initialized to zero. During the training process, we used the Adam [38] optimization algorithm to train the models with a learning rate of 10 − 4 . e batch size was set to 64. To avoid overfitting, a dropout layer with a dropout rate of 0.1 was used before the output layer. e coefficient of L 2 regularization was set to 10 − 5 . Besides, we repeated each experiment 10 times and report the average results.

Baseline Methods.
We compare IDAN with several baseline methods that are described as follows: (i) SVM: a commonly used method in traditional machine learning. In this experiment, the input feature was the average value of the word embeddings of the text sequence. (ii) LSTM: a layer of LSTM network is used to model the input sequence. Here, we used LSTM's final representation as the input of softmax function for classification. (iii) BiLSTM: a layer of BiLSTM network is used to model the input sequence. We used BiLSTM's final representation as the input of softmax function for classification instead of pooling after obtaining the hidden state of each word. (iv) ATT-BiLSTM: the attention mechanism is attached on the basis of a layer of BiLSTM network. After using BiLSTM to obtain the hidden state of each word, these hidden state representations were the input of the attention module. (v) H-RNN-CNN [9]: a multilayer network structure for processing Chinese text sentiment classification tasks, in which the input text was divided into sentences as the input of a middle layer to address the problem of information loss that may be caused by long text. In the model, LSTM was utilized to process context sequences, and CNN was used to capture the relationship among sentences. (vi) CRNN [29]: an architecture combining CNN and RNN (LSTM and GRU), which takes advantage of the coarse-grained local features generated by CNN and long-distance dependencies learned via RNN for short texts.

Results and Analysis.
Here, we first give the performance comparison with the baseline methods described above. en, we conduct the ablation study experiment, which aims Computational Intelligence and Neuroscience to explore why the network architecture of IDAN can work well, where the symbol "-" denotes being not reported, and the best performers are in bold. Finally, two visualization cases are presented to illustrate the relationship between attention weight distribution and sentiment polarity of words.

Performance Comparison with Baseline
Models. e performance comparison results are given in Table 3, where SVM performs the worst on the ChnSentiCorp, NLPCC-CN, and NLPCC-EN datasets but is better than LSTM on the MR dataset. is may be related to the situation where sequences on the first three datasets are longer and more complex for SVM. Compared with LSTM, the accuracy of BiLSTM on the ChnSentiCorp, NLPCC-CN, NLPCC-EN, and MR datasets is improved by 1.5%, 0.31%, 1.07%, and 0.97%, respectively. e possible reason is that BiLSTM can capture contextual information from two directions. Since the attention mechanism can assign different attention weight to each word, it can be seen that the performance of ATT-BiLSTM is improved a little bit compared with BiLSTM on all datasets. Besides, although H-RNN-CNN uses two layers of LSTM to model sentences and uses CNN to capture cross-sentence information, its accuracy is higher than ATT-BiLSTM on the MR dataset but is lower on the ChnSentiCorp and NLPCC-CN datasets. Compared with H-RNN-CNN, the performance of CRNN is improved by about 1%. is is because CRNN not only uses multiple CNNs of different sizes to extract the local features of the sequence but also uses LSTM or GRU to capture the long-term dependence of the sequence.
As a simple method, fastText achieves a comparable result with CRNN. Its accuracies on the ChnSentiCorp and NLPCC-EN datasets are even higher than CRNN by about 0.95% and 0.91%, respectively. Although the LR-BiLSTM model incorporates linguistic resources and obtains good performance on the MR dataset, its accuracy is higher than that of fastText by about 0.29% but lower than CRNN by about 0.18%. is may be due to the fact that LR-BiLSTM did not make full use of linguistic resources.
As can be seen, our IDAN model performs best on all datasets. Compared with the best baseline model, the accuracies of IDAN on the ChnSentiCorp, NLPCC-CN, NLPCC-EN, and MR datasets are improved by 0.94%, 2.99%, 5.11%, and 0.38%, respectively, demonstrating the effectiveness of our proposed method.

Ablation Experiments.
e ablation experiment result is shown in Table 4. Firstly, we compared the experimental performance while using different pretrained word vectors. In IDAN-W2V, BERT embedding was replaced by Word2Vec, which results in a significant decrease in performance compared to IDAN. However, it is noteworthy that the performance of IDAN-W2V is still comparable to CRNN. Similarly, when interactive learning between contextual semantics and sentimental tendency information is not implemented, the experimental performance is slightly degraded on all the datasets. Secondly, when we separately ablate the sentimental tendency information part and global attention layer of the full model, its performance will degrade on the ChnSentiCorp, NLPCC-EN, and MR datasets. Particularly, when the global attention layer is ablated, the best result of ablation experiments on the NLPCC-CN dataset can be achieved.
ese ablation experiments show that the performance of IDAN-W2V is comparable to the baseline model CRNN. However, it has a relatively large gap compared with the performance of the full model. Overall, the situations of IDAN-NIL, IDAN-NSTI, and IDAN-NGA are relatively similar, and their performance is better than IDAN-W2V but slightly lower than the full model.
ese results indicate that the BERT embedding has brought about a considerable performance improvement to our method. Moreover, extracting the sentimental tendency information, learning the interactive representation, and the global attention layer also help improve the classification performance.

Case Analysis.
In this section, an English review text on the NLPCC-EN dataset and a Chinese review text on the ChnSentiCorp dataset are used as the case analysis. Figure 4 is the visualization result of the attention weights calculated by equation (9) for two test cases. Here, the color concentration reflects the attention weight of the corresponding word, that is, the importance of words. e sentiment polarity of Figure 4(a) is positive, and that of Figure 4(b) is negative. Both Figures 4(a) and 4(b) are predicted correctly by the IDAN model.
From the weight distribution of attention in Figure 4(a), it can be seen that the model assigns greater weight to words or phrases with strong positive sentiment, such as "very very nice quality" and "very good price for what you get." In Figure 4(b), the model assigns greater weight to words and phrases with strong negative sentiment, such as (meaning: poor sanitary conditions)"and (meaning: will not stay at the hotel again)". is attention weight distribution illustrates that our model can effectively focus on words or phrases that are important for sentiment polarity.

Conclusion and Future Work
In this paper, we propose a novel model called Interactive Dual Attention Network (IDAN), which can utilize the interaction between contextual semantics and sentimental tendency information for sentiment classification. We design an algorithm to obtain sentimental tendency information and extract the BERT embedding as the model embedding layer. We also use BiLSTM networks to learn the dependencies of contextual semantics and sentimental tendency information, respectively. Finally, multihead attention is used to implement interaction, and global attention is utilized to focus on the important parts of the sequence and to generate the final representation for the classifier. Extensive experiments conducted on four benchmark datasets show that our method is effective and totally great! easy installation, very very nice quality. very good price for what you get. keeps my two boxers from rushing to the front door to greet the guests. especially for the two three year old grandkids that don't really like the kisses from the boxers.    outperforms the competition baseline methods. Furthermore, ablation experiments illustrate that BERT embedding has brought about a considerable performance improvement. Meanwhile, extracting the sentimental tendency information for the interactive representation also contributes to performance improvement. For future work, improving the algorithm for extracting sentimental tendency information and optimizing the interactive attention network may further improve the classification performance and obtain more interpretability. Furthermore, we also plan to introduce more refining linguistic knowledge into the network to make the model be more discriminative and robust.
Data Availability e data and the authors' source code used to support the findings of this study will be available at https://github.com/ zhuyl96/IDAN.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.