Multitask Learning for Aspect-Based Sentiment Classification

Aspect-level sentiment analysis identifies the sentiment polarity of aspect terms in complex sentences, which is useful in a wide range of applications. It is a highly challenging task and attracts the attention of many researchers in the natural language processing field. In order to obtain a better aspect representation, a wide range of existing methods design complex attention mechanisms to establish the connection between entity words and their context. With the limited size of data collections in aspectlevel sentiment analysis, mainly because of the high annotation workload, the risk of overfitting is greatly increased. In this paper, we propose a Shared Multitask Learning Network (SMLN), which jointly trains auxiliary tasks that are highly related to aspectlevel sentiment analysis. Specifically, we use opinion term extraction due to its high correlation with the main task. -rough a custom-designed Cross Interaction Unit (CIU), effective information of the opinion term extraction task is passed to the main task, with performance improvement in both directions. Experimental results on SemEval-2014 and SemEval-2015 datasets demonstrate the competitive performance of SMLN in comparison to baseline methods.


Introduction
Sentiment analysis is one of the fundamental tasks in natural language processing and has received an increasing level of attention in recent years. Aspect-level sentiment classification focuses on fine-grained sentiment analysis and is widely applied to automatic processing tasks for online review text. e purpose of this task is to determine the emotional polarity of entities in each aspect of a review piece [1][2][3][4], with each entity consisting of one or multiple words. e number of aspect terms in a sentence is arbitrary [5][6][7][8][9], and each aspect may carry a different sentiment polarity. Within the sentence "I love this program, it is superior to windows movie maker" in Figure 1, "program" and "windows movie maker" are two separate aspect terms, but they carry positive and negative emotions, respectively. In the example above, "love" and "superior" are defined as opinion terms. It can be observed that the emotional polarity of aspect words comes from their corresponding opinion words. Existing algorithms for aspect-level sentiment analysis are mainly divided into feature engineering methods and deep learning models. For the methods based on feature engineering, the main idea is to design a series of handcraft features, and a traditional classifier is trained to achieve high emotion classification accuracy [5,10,11].
is class of methods consumes a lot of manpower, and the vocabulary dependency on individual scenarios makes it difficult to generalize.
Models based on deep neural networks have better potential in solving these issues. rough multilayer neural networks, low-dimensional vectors representing term semantics can be effectively trained without complex feature engineering process. ese embedding representations become the input of downstream neural networks, in order to identify the emotional polarity of target words. Satisfactory results have been achieved through various target-dependent sentiment mechanisms [5][6][7][8][9]. ese Long Short-Term Memory (LSTM)-based methods take static word vectors such as Word2Vec and GloVe as input and use the feature representation of entity words for sentiment classification. ey simply fuse contextual information into the representation of the target word, without considering their semantic correlation. Recent research efforts apply the attention mechanism to consider interaction between the aspect words and their context. A variety of complex structures were designed to calculate attention weights between aspect words and their context [12][13][14][15]. As most aspect-level datasets do not come in large scale, the risk of overfitting greatly increases with the model size and complexity. As a result, methods based on the attention mechanism tend to make more mistakes when mining deeper features.
In recent years, multitask learning (MTL) becomes an active research area in machine learning, which improves the generalization performance of a task by jointly training other related tasks [16]. Due to the success of MTL [17], there are several NLP models based on neural networks that adopt this mechanism [18][19][20]. By using shared representations to learn semantically related tasks in parallel, MTL captures the correlation between tasks, improving the generalization ability of the model under certain circumstances. e multitasking architecture of these models contains shared lower layers to train their common features, and the remaining layers are customized to handle different tasks. For aspect-level sentiment classification, [21,22] have made some attempts in MTL, and their study shows that joint training with document-level sentiment classification can significantly improve performance at the aspect level. Yu and Jiang [23] design an auxiliary task that is highly relevant to sentiment analysis, which predicts whether the input sentence contains positive or negative words.
In this study, we propose a Shared Multitask Learning Network (SMLN), which employs opinion word extraction as an auxiliary task and trains it together with aspect-level sentiment analysis. is task is an upstream step that extracts key opinion terms, and its performance also affects the accuracy of the main task. e pretrained BERT model is used as the underlying structure shared by the two tasks. SMLN introduces a new feature sharing mechanism, Cross Interaction Unit (CIU), to facilitate the information exchange between the main task and the auxiliary task. Specifically, CIU consists of multiple groups of attention mechanisms, integrating the information of the two tasks from different viewing angles. With extensive experiments on the SemEval-2014 and SemEval-2015 datasets, the results indicate that our SMLN model outperforms other baseline methods in terms of classification accuracy. For a fair comparison, some of the baseline methods are built on the same BERT-based representation, so the performance boost is originated from the multitask setting and information sharing mechanism.
Main contributions of the work include the following: (1) a multitask learning method customized for fine-grained sentiment analysis, which utilizes additional opinion word information to improve the learning performance of the current task and reduce the risk of overfitting; (2) a multitask sharing mechanism to accomplish multiview information transfer between tasks.

Related Work
In this section, relevant research on aspect-level sentiment analysis and multitask learning is reviewed, covering both traditional neural networks and more recent BERT-based model.

Conventional Neural Network.
In aspect-level sentiment analysis, it has been agreed among researchers that context words have different influence on the sentiment polarity of multiple targets in the opinion sentence. When a learning system is built for sentiment classification, the most important task is to integrate the relationship between each target and its context. Vo and Zhang [7] divide the original sentence into the target words, the left context, and the right context and use different networks to extract their features. Tang et al. [8] propose a target-dependent LSTM model. In order to represent the characteristics of the aspect word more accurately, they use two LSTMs to encode the previous context and the next context including the target itself.
Previous studies try to establish the connection between the target words and their context, but the interaction information is hard to capture with canonical methods. Wang et al. [12] propose an LSTM network based on the attention mechanism, which is used for aspect-level sentiment classification. When different aspects are involved, this model automatically directs its attention to different parts of the sentence. Li et al. [13] design an end-to-end structure, constantly focusing on an aspect term and its context. Ma et al. [14] believe that entities and corresponding contexts can be reasonable, and calculate attention weights for target and context, respectively. Fan et al. [15] propose a multigrained attention network. In addition to the attention calculation of the aspect words and the overall context, they also introduce fine-grained attention.
e purpose is to describe the influence of an aspect on its context, or context on aspect words in the reverse direction. Tang et al. [24], Chen et al. [25], and Zhu and Qian [26] propose a readable and writable external memory module, which shows the contribution of each word to the final sentiment classification.

Multitask
Learning. All methods mentioned above focus on a single task and obtain acceptable performance in aspect-level sentiment analysis. If related tasks are built on the same text representation and language modeling, it is possible to further improve the learning performance of the main task. In recent years, studies have adopted multitask learning methods to handle fine-grained sentiment analysis. Yu and Jiang [23] design an auxiliary task to determine if an input sentence contains positive or negative sentiment words. is auxiliary task is closely related to the main task of sentiment analysis. ey propose to train the sentiment analysis task and the hidden feature representation task together, and the auxiliary task helps in generating more "I love this program, it is superior to windows movie maker Aspect Term Opinion Term representative embedding for sentiment analysis sentences. He et al. [21] propose a multitask learning method that jointly trained the document-level and aspect-level sentiment classification tasks. Leveraging the information at the document level, current aspect model's limitation is alleviated with the introduction of the larger datasets. He et al. [22] employ an information delivery mechanism so that identical implicit expressions can be shared among multiple tasks in an iterative manner. is multitask model can utilize more global knowledge to improve the accuracy of sentiment analysis.

BERT-Based Networks.
Traditional neural networks include structures like LSTM or multilayer CNN as encoders and use word vectors generated by Word2Vec or GloVe. However, the performance of these models is limited by the static nature of word vectors, and the downstream networks cannot break the performance bottleneck imposed by the text representation. In order to solve this problem, recent research has focused on large-scale pretrained attention models like ELMo, GPT, and BERT, and gains have been observed in many text applications with the richer representation. Sun et al. [27] propose four methods of constructing auxiliary sentences with the aspect term, feeding auxiliary sentences together with the original sentence as the input to the BERT model. is method transforms aspectlevel sentiment analysis into a sentence pair classification task. Xu et al. [28] learn domain knowledge based on largescale pretraining with BERT in the same domain as the original dataset. Gao et al. [29] design an intuitive method to use the feature expression of the target word on aspect-level sentiment classification, with slight modification of the BERT model. Zhou et al. [30] use the graph convolutional network (SK-GCN) model of grammar and knowledge to enhance the representation of sentences for given aspects. Song et al. [31] propose a semantics perception and refinement network (SPRN) for sentiment analysis based on aspects. Local semantic features are extracted by multichannel convolution operation. ey use gated networks to enhance aspect and context connections while filtering noise.

The Approach
e SMLN architecture is shown in Figure 2. In this section, we will elaborate on the details of the SMLN structure for aspect-level sentiment classification. It starts with the definition of the main task and auxiliary tasks, together with the necessary notations. en, the BERT-based representation is included as a shared layer. CIU, which is the unit that establishes interaction between the main and auxiliary tasks, is introduced after that.

Problem Definition and
Notations. s i denotes a sentence from the training dataset, which consists of a sequence of tokens: Sentence s i includes target words t that need to be annotated with their polarity of sentiment, and opinion words o that carry the corresponding emotion information. A target contains m words; the opinion terms include k words; and t and o are both subsequences of s. For the main task, aspect-level sentiment classification (ASC), its goal is to determine the emotional polarity of the target word t in the sentence s i . Available tags include "positive," "negative," and "neutral." For the auxiliary task, opinion terms extraction (OTE), its aim is to extract all the opinion terms appearing in a sentence. For simplicity, we treat OTE as a sequence tagging problem, with the BIO tagging scheme. Specifically, we use three categories of tags: Y OTE � B, I, O { } indicating the beginning and interior of the opinion term, and other words, respectively. For example, for the sentence " e screen is large and crystal clear with amazing colors", its opinion extract label is shown in Table 1.

Shared Layer.
e input embedding layer maps the original text representation into a high-dimensional vector space. e pretrained BERT model is employed to obtain embedding representation of each word with fine-tuning capability in the original Transformer network. BERT [32] is one of the leading language representation models, which uses a bidirectional Transformer [33] network to pretrain a language model on a large text corpus, and the pretrained representation can be fine-tuned on other tasks. e taskspecific BERT design is able to represent either a single sentence or a pair of sentences as a consecutive array of tokens. For a given token, its input representation is constructed by adding up its corresponding token, segment, and position embeddings. For a typical classification task, the first word of the sequence is identified with a unique token [CLS], and a fully connected layer is attached at the [CLS] position of the last encoder layer. e last layer is usually softmax which completes the classification task.
BERT has two parameter intensive settings: BERT base : e number of Transformer blocks is 12, the hidden layer size is 768, the number of self-attention heads is 12, and the total number of parameters for the pretrained model is 110M. BERT large : e number of Transformer blocks is 24, the hidden layer size is 1024, the number of self-attention heads is 16, and the total number of parameters for the pretrained model is 340M. e BERT large model requires considerably more memory than BERT base . As a result, the maximal batch size for BERT large is so small on a single GPU with limited memory that it actually hurts the model accuracy, regardless of the learning rate [32]. erefore, we use BERT base as our baseline model, with modifications that do not significantly increase the model size.

Scientific Programming
Following annotations in the previous section, a sentence with size n contains a target/aspect that is composed of m terms. BERT uses WordPiece [34] as its tokenizer. After the multilayer bidirectional Transformer network, the word vector matrix Sr of the sentence S is represented by the hidden status of the last layer.
Sr ∈ R (n+2)×d , where d is the dimension of hidden state. x 0 is the vector of the sentence classification mark [CLS], and x (n+1) is the word vector of the sentence separator [SEP]. en, we use two Bi-LSTM networks to decompose Sr. e outputs of the two networks are denoted as S p and So, which focus on ASC and OTE, respectively.

Cross Interaction Unit.
When generating the representation with two independent Bi-LSTMs, the information of ASC and that of OTE, as two individual tasks, are separated from each other. However, the reality is that the two parts are closely related. For example, when 'love' appears around an aspect term, its polarity is likely to be positive. e Cross Interaction Unit (CIU) is designed to exchange information between these tasks, mining opinion terms and identifying aspect sentiment polarity in a cooptimization manner. e CIU architecture is shown in Figure 3.
A basic CIU is composed of a pair of attention modules: polarity attention and opinion attention. We define the output S p � p 0 , p 1 , . . . , p n , p n+1 and S o � o 0 , o 1 , . . . , o n , o n+1 of two Bi-LSTM networks to represent the distribution of sentiment feature and opinion feature, respectively, where p i ∈ R 2d and o i ∈ R 2d are representations of the i-th token w i . P and O are input to the emotional attention module, in which we first calculate the composition vector α p ij ∈ R K between P and O through a tensor operation: where G p ∈ R K×2d×2d is a 3-dimensional tensor. A tensor operator can be viewed as multiple bilinear terms that are capable of modeling more complicated compositions between two vectors [35]. K is a hyperparameter representing the number of G p channels. Each channel of G p is a bilinear term that can extract specific information. A larger number of K represents the complicated intrinsic correlation between sentiment classification features and opinion word extraction features. As the value of K increases, more information is extracted together with higher complexity. After obtaining the composition vectors, the attention score e p i for token w i is calculated as e Here, v p ∈ R K can be seen as a weight vector to measure each value of the composition vector. e p ij is a scalar value that composes a matrix E p . A higher score for e p ij indicates that the current sentiment feature of the i-th word captures more information from the opinion expression of the j-th word. Finally, we fuse the sentiment feature P and opinion feature O generated by the original Bi-LSTM as follows: where softmax r is a row-based softmax function. S p ′ represents the final sentiment expression of the sentence. Similarly, we can get the final expression of the opinion where Tr ∈ R m×2d and m represents the length of the target word. A max-pooling operation is performed on the target word vector, and the most important features at each position are selected from different words.
Finally, V is fed into a fully connected layer for classification. For the opinion extraction task, we use a dense layer plus a softmax operation to generate the final opinion tags.

Joint Learning.
Output from the previous step contains representation of the original text for two purposes: one is polarity labeling, and the other is opinion term labeling. ese tasks require different forms of output, so it is necessary to apply gradient descent training that better fit their respective application.
For the sentiment classification branch, V represents the polarity characteristics of the target word. It passes through the fully connected softmax layer to obtain probability values representing emotion polarities.
After that, we use the standard cross-entropy loss as the cost function: where a represents an aspect term appearing in training data D. C represents the number of categories of sentiment classification. P c (a) is the probability of predicting s as class c from the softmax layer, and P g c (a) indicates whether class c is the correct sentiment category, with value 1 or 0.
For the opinion term extraction branch, all possible outputs of the tag sequence are defined as array Y. Y real is the true label sequence. From the SMLN, the feature value of each location is converted into a probability value through softmax, and the formula is as follows: e goal of model optimization is to increase the probability of the appearance of the true label and ultimately reduce the value of the loss function. e objective loss function of the opinion word recognition model is defined as follows: Losses of the main sentiment classification task and the auxiliary opinion word extraction task are aggregated to form the total loss J(θ) of the framework.

Datasets.
Our framework is evaluated on three benchmark datasets from SemEval-2014 [4] and SemEval-2015 [36]. Statistics of the datasets are shown in Table 2. For simplicity, we use 14Lap, 14Res, and 15Res to denote SemEval-2014 Laptops, SemEval-2014 Restaurants, and SemEval-2015 Restaurants, respectively. ere are four emotional labels in the entities in datasets, which are "positive," "negative," "natural," and "conflict." "Conflict" means that there are more than two emotions in the same entity. Labels on opinion terms are provided by Wang et al. [35,37].

Experiment Settings.
e pretrained uncased BERT-base model is used for fine-tuning. e number of Bi-LSTM hidden units is set to 300, and the output dimension of Bi-LSTM is 600. e hyperparameter K is set to 5. In the finetuning process, the same parameter settings as the BERT model are kept to ensure comparable results to other baseline models. To avoid overfitting, the dropout  Table 3.

Baseline Approaches.
Following the convention of related work, the average accuracy metric is used to measure the overall performance of sentiment classification models.
To show the effectiveness of our model, several mainstream models for aspect-based sentiment analysis are used for comparison, including the following: TD-LSTM [8] uses two LSTM networks to model the correlation between the target word and its context, and it concatenates the last hidden state of the two parts to predict the sentiment polarity of the target. ATAE-LSTM [12] applies a typical attention-based LSTM structure to capture the key part of the sentence in response to a given aspect. MemNet [24] is a deep memory network that applies multiple attention layers to capture the importance of each context word and predicts sentiments based on the sentence representation at the top level. RAM [25] has a multilayer architecture where each layer consists of an attention-based aggregation of word features and a GRU cell to strengthen the expressive power of MemNet. IAN [14] contains two LSTMs to encode target words and context information independently and completes the interaction of the two parts of information through the attention mechanism. TNET [38] proposes a transformation unit for target representation, so that word coding can fully capture the key information of the target. In addition, the authors use a context feature preservation mechanism to better obtain useful information from the context. TG-SAN [39] includes two core units. One is Structured Context extraction Unit (SCU), which undertakes the task of encoding semantic groups and extracts context fragments related to objects. e second is Context Fusion Unit (CFU), with the purpose of learning the contribution of the extracted context to the object. IMN [22] designs an end-to-end interactive multitask learning network for a variety of fine-grained sentiment analyses. General word vectors and domain-specific word vectors from [40] are concatenated as input. In the model, a special information transfer mechanism is implemented to help the model transfer information between the token level and the document level. PRET + MULT [21] uses document-level knowledge to improve the performance of aspect-level sentiment analysis. PRET represents the use of documents to train the weight of LSTM, and MULT implies the use of multitask learning methods to complete documentlevel and aspect-level sentiment analysis tasks. BERT-FC is the vanilla model built on BERT. e base BERT model is fine-tuned on the target task, and information is extracted at the placeholder [CLS] token for sentiment analysis. TD-BERT [29] is also based on the BERT fine-tuning model. Instead of the BERT default token [CLS], the vector corresponding to the position of the target term is fed into downstream pipeline. e output is also processed by softmax to get the final emotion category. AEN-BERT [9] proposes an Attentional Encoder Network (AEN) which does not use the traditional recurrent structure. It employs attention-based encoders for the modeling between a target and its context. ey raise the label unreliability issue and introduce label smoothing regularization. BERT-PT [28] explores a posttraining method on the BERT model with related datasets, with an expectation that the introduction of additional data will improve fine-tuning performance of BERT for sentiment classification. BERT-pair-QA-M [27] constructs an auxiliary question from the target and uses it together with the original sentence as input. It converts the sentiment classification task into a special QA problem. Since the original paper is applied to task 2 of SemEval-2014, which is not the same as ours, reproduced performance data from [29] is taken as the result. SK-GCN-BERT [30] proposes a new syntax-and knowledge-based graph convolutional network (SK-GCN) model which leverages the syntactic dependency tree and commonsense knowledge via GCN. In particular, to enhance the representation of the sentence  Scientific Programming towards the given aspect, it develops SK-GCN to combine the syntactic dependency tree and commonsense knowledge graph. SPRN-BERT [31] proposes a semantics perception and refinement network (SPRN) for sentiment analysis with aspects. Local semantic features and global context information are extracted by multichannel convolution and SA, respectively. en, gated network (DRG) is used to enhance the connection between aspect and context while filtering noise. Table 4 shows the performance of our model together with previous methods described above. Models in the first part of the table use traditional neural network methods. ese methods rely on well-designed attention and LSTM to process static word embeddings. ese pretrained word embeddings are generated on largescale generic corpus or domain-related datasets [40] through the Word2Vec [41] or GloVe [42] method. e second part includes previous multitask learning methods on aspectlevel sentiment analysis. e third part shows other models based on the BERT representation. For better performance, they also include customized revisions for the fine-grained sentiment analysis task. Compared with the methods above, our SMLN model shows clear performance improvement over the baseline methods on three datasets from SemEval-2014 and SemEval-2015.

Results and Analysis.
is result benefits from the multitask learning setting as well as the information exchange mechanism in CIU.
For the ABSA task, BERT-based models have achieved significant accuracy improvement in comparison to the original static word embeddings. e multilayer Transformer stack structure of BERT has the clear advantage of representing the intrinsic semantics of terms in the context. BERT is a better choice as the shared representation layer in the model, in comparison to static embeddings which lack flexibility in word semantics. When compared to other BERT-based methods, our model still shows significant improvements, about 5% over the baseline BERT-FC model. Our analysis shows that the improvement is mainly from the multitask learning framework. In this work, we introduce opinion terms extraction (OTE) as an auxiliary task. OTE and ABSA tasks are closely related, but there are also clear differences between them. Complementary information from similar but different applications can be used as regularization items between tasks. It effectively improves the generalization performance of the model. In order to further improve the transmission of complementary information between different tasks, we design the CIU module based on an improved self-attention mechanism.

Ablation Study.
In order to study the effects of different components, we gradually add auxiliary tasks and CIU modules starting from the vanilla model. Vanilla model represents the combination of TD-BERT and LSTM network.
e experimental results are shown in Table 5, in which each variant adds a new module based on the previous model. With only the auxiliary task to form a multitask framework, the model achieves a small but noticeable improvement for both the 14Lap and 14Res dataset. It can be considered as the generalization performance improvement brought by the multitask method. At this time, the model cannot benefit from the emotional information provided by the auxiliary task. When the CIU module is added, the additional improvement is about twice that of the previous step. With the CIU, the emotional knowledge has been successfully transferred to the ABSA task.

Discussion
For the aspect-based sentiment classification task, we design an SMLN based on multitask learning and attention mechanism. is network can better utilize the rich emotional information in the context and related information among similar tasks at the same time. It tries to solve the problems of sentiment classification and opinion word extraction in an end-to-end manner. In this model, text information is first converted into a vector representation by the BERT preprocessing model. is representation is a common feature in the shared layer that applies to all downstream tasks, and output of the shared layer enters two  Scientific Programming independent Bi-LSTM networks to learn the unique features of each task. In particular, this article designs an information interaction unit between two independent representations. is module accomplishes the function of information transfer between the two parts based on the attention mechanism. On publicly available sentiment analysis datasets, its performance is compared to many existing ABSA methods, including some recent work that claims to be state-of-the-art. One all three datasets, the SMLN model achieves competitive results in aspect-based sentiment classification. Its classification accuracy reaches 80.09%, 85.67%, and 86.31% on the 14Lap, 14Res, and 15Res datasets, respectively. To verify the value of its two main components, auxiliary tasks and CIU module, an ablation test is carried out; it shows the step-by-step performance improvement when each individual component is added. e results demonstrate the effectiveness of the SMLN, together with detailed analysis for each component.
In the NLP literature, attention mechanism has been widely used because it can better learn long-range sequence knowledge. However, the latest research shows that the pure self-attention network (SAN) without skip connection and multilayer perceptron (MLP) loses certain expression ability. e loss of feature extraction ability is related to the network depth in double exponential order. Specifically, the researchers prove that the network output converges to a rank-1 matrix at the rate of cubic convergence [43].
us, we are currently focusing on the following extensions to the proposed method. First of all, we try to design multilayer attention units in CIU module to obtain stronger feature fusion ability, which is helpful to understand and infer the implied semantics in sentences. Further research aims to explore the changes of internal attention matrix in the process of model reverse updating. Second, we plan to introduce more subtasks into our multitask learning framework, such as entity extraction. e addition of related tasks helps to improve the generalization performance of each task. Finally, we are exploring the effectiveness of our method for other NLP tasks, such as relationship extraction. Overfitting is a common issue for NLP tasks, especially when the model complexity exceeds data size. Multitask learning is an effectively way to improve the generalization ability of a complex model, but understanding the internal correlation between these tasks is more important than blindly stacking more tasks.

Conflicts of Interest
e authors declare no conflicts of interest.