An Improved BERT and Syntactic Dependency Representation Model for Sentiment Analysis

Text representation of social media is an important task for users' sentiment analysis. Utilizing the better representation, we can accurately acquire the real semantic information expressed by online users. However, existing works cannot achieve the best results. In this paper, we construct and implement a sentiment analysis model based on the improved BERT and syntactic dependency. Firstly, by studying the word embeddings of BERT, we have ameliorated the embeddings representation. Attention mechanism is added to the word embeddings, sentence embeddings, and position embeddings. Secondly, we have exploited the dependency syntax analysis of the text, and the dependency relationship of different syntactic components will be obtained. For different syntactic components, the hierarchical attention mechanism is used to construct the phrase embeddings or block embeddings. Finally, we splice the syntactic blocks for sentiment analysis. Extensive experiments show that the proposed model has a stronger ability than the baselines on two standard data sets.


Introduction
In recent years, with the popularization of social media such as WeChat, Face Book, Twitter, and Fetion, these media are changing people's lifestyles and habits. How to represent the text and understand their semantic information accurately is an important task. However, existing works cannot achieve the best results. In general, the composition of a text can be subdivided into paragraph-level, sentence-level, and wordlevel. e words are basic components, and the representation of text can be divided into a series of word combinations. erefore, researching on the word-level representation is extremely important compared with the other two.
With the innovation of hardware technology, we can do a large number of calculations or parameter learning. However, how to integrate more semantic information on text representation is an important and difficult task for natural language processing. Harris has put forward an important idea on text representation as early as the 1950s, which is the famous distributed hypothesis: words with similar contexts have similar semantics. Firth elaborated Harris' thoughts a few years later. A more direct expression is that the semantic information about a word is mainly determined by its context. In the last ten years, the computing capability has been greatly improved, especially the wide application of GPUs and TPUs, which have made the analysis, calculation, and processing of big data easier. e contributions of our paper are as follows: (1) We have improved BERT (iBERT) to obtain a better representation. Respectively, the Token Embeddings (TEs), Segment Embeddings (SEs), and Position Embeddings (PEs) have different attention weights. (2) We have constructed the syntax tree based on syntactic dependency of block embeddings. (3) Combining with the attention mechanism, we have constructed the embeddings representation of the text.

Related Works
e expression of any language can be divided into several levels, such as paragraph-level, sentence-level, and word-level. e basic unit of meaningful representation is word-level.
ere are two methods for the vectorized representation about words, one is the One-Hot model and the other is the Distributed Representation model. e idea of One-Hot representation is very simple. e dimension of word embeddings is measured by the number of words appeared; that is, the dimension of each word is equal to the total number of words. Only the position where the word has appeared is represented by 1, and the remaining positions are represented by 0. For instance, the word embeddings of "computer" and "PC" are [0, 0, 1, 0, 0, 0, 0] and [0, 0, 0, 0, 0, 1, 0], respectively. As we all know, the two words have the same meaning. Nonetheless, the similarity between them is zero. erefore, One-Hot representation cannot express the similarity of words. If the amount of data is increased, it is prone to dimension disasters. erefore, many applications have adopted the Distributed Representation model.

Distributed Representation.
To acquire the semantic information about words and alleviate a series of problems in depth, there are two classic models, Word2Vec and BERT. In 2003, Bengio et al. [1] proposed the NNLM model, which obtained the word embeddings when training and constructing a language model. On this basis, Mikolov et al. [2] proposed the Word2Vec which contained two models (Continuous Bag-of-Words and Skip-gram) in 2013. e CBOW model used the context to predict the current word, while the SkipGram model used the current word to predict the context.
In 2018, Devlin et al. [3] proposed the BERT model (Bidirectional Encoder Representations from Transformers), which is another substantial achievement after Word2Vec. And it has achieved the optimal results on 11 tasks in natural language processing. is achievement also proved the importance of the two-way and pretraining model for text representation. Many related models have appeared one after another, such as SpanBert [4], RoBERTa, and XLNet [5]. To further improve the text language processing effect, a convolutional neural network model, Hybrid convolutional neural network (CNN), and Long Short-Term Memory (LSTM) based on the fusion of text features and language knowledge were proposed [6]. Chen et al. [7] proposed a new representation learning method combined with variational autoencoder (VAE) and density-based spatial clustering of applications with noise (DBSCAN).

Coarse-Grained Semantic Representation.
Combining textual semantics, we can construct larger granularity of text representation, such as grammatical blocks, sentence-level, and document-level. e Paragraph Embeddings [8] and the Skip-oughts were the influential models. Paragraph Embeddings consisted of two submodels. One was to evaluate the central word-by topic embeddings and context information. e other used paragraph or sentence level evaluated the probability of words. However, Skip-oughts had an integrated encoder and decoder which modeled the context-related topics of physically adjacent sentences. Furthermore, to achieve accurate semantic information in multiple documents, Lin et al. [9] proposed a semantic search model for knowledge documents. Yan and Gao [10] studied the coupling of internal topics and topological structure, and they modeled large-grained semantics. Wu et al. [11] proposed a multigranularity and cross-text semantic matching method by a deep neural network, which had obtained better results in the text matching field.
In recent years, due to the wide application of deep learning in text processing, the combination of multiple models (such as RNN, CNN, LSTM, GRU, Transformer, and BERT) is very widely used. Sun et al. [12] proposed a secure indoor crowdsourced localization system, BERT-ADLOC, which was based on BLE fingerprints. e system consisted of two main parts: adversarial sample discriminator BERT-AD and indoor localization model BERT-LOC. Jiang and He [13] had presented an attention mechanism that differentiated the focus on the output of ResNet and the long shortterm memory for the features of the sequences. Alahmadi et al. [14] proposed a smartphone-based periocular recognition which used a deep convolutional neural network and collaborative representation. Cross-modal convolution could enable the use of efficient CNN-style layers for multimodal sequential models.
In addition, other models which have obtained excellent performance in image fields have been gradually migrated and applied to some subtasks of text processing. e convolutional models, which have combined words and phrases, have achieved better results in classification and sentiment analysis [15].

The Text Representation Model of Online Social Media
Any language has its corresponding language features and grammatical rules which are the key requirements for the meaningful expression. erefore, making use of the word embeddings and grammatical structure, we can construct a better semantic representation of the text. Firstly, given a text in social media, it is necessary to preprocess the text (such as word segmentation and part-of-speech). Secondly, we have proposed the improved BERT model to obtain better word embeddings and utilized the direct dependencies of words to build a dependency tree. Finally, the improved BERT model (iBERT) and dependency trees are used to construct the semantic representation of the text. e framework is shown in Figure 1.

Word Embeddings
Based on the iBERT. BERT obtains the input embeddings by summing multiple embeddings. ese embeddings include the Token Embeddings (TEs), Segment Embeddings (SEs), and Position Embeddings (PEs). We have improved the BERT. e final inputs are represented by attention summation of the three embeddings, as shown in the following equation: where α, β, and c are the attention weights of TE, SE, and PE.

Computational Intelligence and Neuroscience
As shown in Figure 2, N is the length of the input sequence, and d model is the dimension of the word embeddings. Position ., E n are obtained by equations (2) and (3). To facilitate comparison with the standard BERT, our paper adopts the same formulas as the official.
E PE(pos,2i) � cos pos where pos denotes the position number of the word in the input sequence. e word in the even position is calculated by equation (2) (in the odds by equation (3)). e overall framework of the BERT model utilizes the officially released structure. Transformer that belongs to the encoder-decoder architecture uses a two-way and self-attention mechanism. e main operations of the encoder in Transformer module are the following equations: where e in ∈ R N * d denotes the input of the encoder and e out ∈ R N * d denotes the output of the encoder. Multi Head Attention(·) represents a multiheaded attention mechanism. FFN(·) is a feedforward neural network. Layer Norm(·) represents layer normalization.
In the Transformer module of iBERT, the main operations of the decoder are as follows: where d in ∈ R M * d denotes the input of the decoder. d out ∈ R M * d denotes the output of the decoder. Multi Head Attention(·), Layer Norm(·), and FFN(·) represent the same functions as those of the encoder. Masked Multi Head Attention(·) is a masked and multihead attention mechanism.

Syntax Tree Construction Based on Syntactic Dependency.
e syntactic tree of a sentence is an interdependence graph of its words which determine their importance by the distance from the central word. Andor et al. [16] proposed a transformation-based dependency syntax analysis method. And they developed the SyntaxNet (http://github.com/ tensorflow/models/tree/master/syntaxnet) system, which was the most popular construction method of the syntax tree. rough researching this system and making corresponding improvements, we have adopted a generation scheme for the syntax tree based on the arc transformation.
is method uses a stack (STACK), buffer (BUFFER), and set (ARC_SET) [17]. s 1 s 2 . . .s j . . .s n is a given text; s j is the jth word. e execution is that the STACK only has the root node at the beginning. e ARC_SET is an empty set, while the BUFFER saves the word sequence of input. ere are three operations, LEFT_ARC, RIGHT_ARC, and SHIFT (see Algorithm 1). e LEFT_ARC operation is that the current word in the buffer will be added a left arc to the word on top of stack, the RIGHT_ARC operation will add a right arc as LEFT_ARC does, and the SHIFT operation will transfer the current word into stack. Until all words in the buffer are all processed, the state of STACK is consistent with the initial, and the construction of the syntax tree has been completed. Figure 3 is the dependency tree constructed by this method of the sentence "a woman washed the dishes."

Text Representation Based on iBERT and Syntactic
Dependence.
e dependency tree is constructed by Algorithm 1. According to the different attention weights in different syntactic positions, we combine and splice them into the corresponding text semantic representation. Attentions s i � 1.
Combining the attention mechanism and word embeddings, we construct sentence embeddings, as shown in equation (8). Attentioned_Embeddings s i is abbreviated as According to the constructed syntactic tree, which contains the dependency relationship between words, we can construct the phrase embeddings of the syntax blocks (equations (9) and (10)). e phrase embeddings are atten-tion_weighted of their words, as shown in Figure 4. Where Represent sentence denotes the semantic embeddings and phrase vector i represents the syntactic elements in the sentence which are mainly involving the subject, predicate, object, and other syntactic elements.

e Sentiment Analysis Model Based on iBERT and Syntactic Dependency.
To verify the effectiveness of the proposed model, we construct a text sentiment model in this section, as shown in Figure 5.
We denote the text embeddings as Text_1, Text_2, . . ., Text_n. e stage from the vectorized representation to the sentiment categories is a fully connected network. Parameter weight W is obtained after training, and this matrix is locked (or fixed) during the test. Sentiment categories C � c 1 , c 2 , . . . , c k . k is the total number of categories in the sentiment      Computational Intelligence and Neuroscience classification, and the probability P c that belongs to a certain category is obtained by the following formula: We use the softmax function to normalize and obtain the category with the highest probability, as shown in the following equation: e cross-entropy Loss i is used to train the model, and the formula is shown in the following equation: where y im represents the probability that the i-th sample belongs to the m-th class (m ∈ C). If it belongs to the m-class, y im is 1; otherwise, it is zero. P c im represents the prediction probability of the i-th sample belonging to the m-th category. To ensure obtaining a more robust model, we have utilized a dropout strategy. e dropout is used in a fully connected network with the vectorized representation TE to sentiment category C, and the value of dropout is set to 0.5.

Data Set.
e first data set is task 4 in SemEval 2014, which contains two subdata sets, one is the Laptop and the other is the Restaurant. eir format is described by XML. In the Laptop, the number of sentences in training is 3045 and in the test is 800. In the Restaurant, the number of sentences in training is 3041, and the number of sentences in the test is 800.
Another data set is Subtask A [18] in SemEval 2017, which is mainly used for SDQC support and rumor classification. e classification of the training or testing is shown in Table 1.
S, D, Q, and C, respectively, represent the four categories, which are the support category (Support), the objection category (Deny), the doubt category (Query), and the irrelevant comment category (Comment). Category S denotes supporting related content. Category D represents the opposing related content. Category Q owns questions about related content, and category C expresses comments that have nothing to do with related content or themes.

Evaluation.
We have used the accuracy (AC) for evaluation of the experiments as shown in the following equation: TP (True Positive) indicates that the predicted (positive) is consistent with the actual (positive). FP (False Positive) denotes that the predicted (positive) is inconsistent with the actual (negative). TN (True Negative) represents that the predicted (negative) is consistent with the actual (negative). FN (False Negative) indicates that the predicted (negative) is inconsistent with the actual (positive).

Parameter Settings.
In the learning stage of word embeddings, the number of layers used is 12 (num_hid den_layers � 12). e number of neurons in the hidden layer of the neural network is 768 (hidden_size � 768), and the length of the input text is uniformly set to 512 characters (num_hidden_layers � 512), the dropout is set to 0.1 (attention_probs_dropout_prob � 0.1), the activation function that used is gelu function (hidden_act � "gelu"), and the number of parameters is about 110 M.
During the construction of the syntactic tree, we use the default parameters in the SyntaxNet system, the small batch size of the syntax analyzer is 32 (parser_batch_size � 32), the learning rate is 0.08 (learning_rate � 0.08), and the momentum is 0.85 (momentum � 0.85)).

Baselines.
e comparison models are as follows: (1) TLSTM [19] divides words into two subsequences, one subsequence is from left to right and the other is from right to left, so two different embeddings will be obtained. e two embeddings coalesce into the final embeddings.
(2) Att-LSTM [20] has utilized the attention mechanism in which the words have different attention weights. e text representation is constructed by the weighted words.  Computational Intelligence and Neuroscience network to acquire the representation through different directions in the sentence. (4) AGCN [22] uses two gated-based convolutional neural networks. ey can obtain different representations, and the gated mechanism can learn the relational information of words. (5) BERT [3] utilizes the transformer as a submodule and obtains word embeddings by a two-way mechanism. (6) GCNDA [23] obtains the weight of words by combining the graphed attention mechanism, and it has two attentions, global and local.
Since there are fewer available comparison models in Subtask A, this paper uses eight models in the system which are released for comparison experiments. α, β, and c. α, β, and c, which, respectively, denote the parameters of word embeddings, sentence embeddings, and position embeddings, take the same value in the iBERT model. For better verifying the effects, we have fixed one parameter and adjust the other two. Task_1 in BERT is used for measurement between the new embeddings and the standard word embeddings. During the experiments, the parameters are normalized by α + β + c � 1.  Table 2.

Parameters
As shown in Figures 6(a)-6(c), the parameters alpha, beta, and gamma refer to α, β, and c, respectively. After indepth analysis of the composition of embeddings, the weight of the word is relatively high, followed by the sentence embeddings and the position embeddings. With fixed position embeddings, the final effect is gradually improved, and the main reason is that part of the word information is contained in the sentence embeddings. e composition of the sentence embeddings can be regarded as embeddings with larger granularity. And all words in the same sentence are used with the same sentence embeddings. To a certain extent, it weakens the representation of word embeddings in the same sentence. However, from another perspective, the word embeddings added to the sentence have a degree of distinction between sentences. erefore, the sentence embeddings are meaningful in the sentence representation. Simultaneously, the calculation of the position embeddings is obtained by equations (2) and (3), which is the empirical formula of BERT team. e main reason is that the different position has different weights for the composed embeddings.
After a number of experimental analyses, when α, β, and c are, respectively, 0.65, 0.20, and 0.15, better word embeddings can be obtained. Table 3 and Figure 7, we can get the following conclusions.

Experimental Results on SemEval 2014. From the results in
e TLSTM model has the lowest accuracy among all baselines. e main reason is that this model only considers part of the content and ignores the representation of deep features. e Att-LSTM model can capture the deep features through the long short-term memory network. Simultaneously, it combines the attention mechanism to obtain the relationship of words in different locations. Hence, this model is more accurate than the TLSTM model. e CABSA model uses a memory network, and the effect of memory-based network is better than seq2seq-related models. e CABSA model can memorize the preceding or subsequent text feature through the memory network. So, this method has achieved a certain improvement to the previous two models.
Because the AGCN model has used two gated convolutional networks, the relationship of words can be obtained to a certain extent, but the syntactic structure cannot be captured commendably. Since BERT is an excellent and pretraining model in recent years, expression ability of word embeddings can be optimized, but the word embeddings constructed by the addition of embeddings can weaken the characteristic information such as syntactic structure. e GCNDA model is the best model among the baselines. e main reason is that this model has used a graphed convolution and combined with an attention mechanism, so that this model can obtain part of the structured information.
Our proposed model is BDPT, which combines the improved BERT and the syntactic structure. It has used the attention mechanism and combined syntactic blocks to construct a combinative text representation. erefore, our method can obtain a deep-level representation of semantic information, and it has achieved higher precision in the classification task of sentiment analysis. Compared with the best model in baselines, our model has improved by 2.1% on the Restaurant and 1.9% on the Laptop. Specifically, the time (seconds per ten sentences) consumed by BDPT is also the lowest (in Figure 8). rough indepth analysis, the time complexity of TLSTM is O(n * m + n * n + n), n denotes Hidden_size, and m represents input_size. Att-LSTM adds a weight matrix, and the time complexity is O(n * m + n * n + n + a * a), a denotes the    Table 4 and Figure 9. e DFKI_DKT model only uses sparse word embeddings as input, which has achieved the worst effect among all comparison models. e IITP model uses pairs of the original text and its response as the input. e IKM model uses the convolutional neural network to obtain the text representation, and it uses the softmax classifier to assign the probability that each category belongs to. IITP and NileTMRG are implemented by linear and polynomial kernel classifiers, respectively, while they are    Computational Intelligence and Neuroscience less effective. e MamaEdha model has mixed and used a variety of neural networks as classifiers. e ECNU system has solved the problem of information imbalance by decomposing it into a two-step classification task. DFKI-DKT, MamaEdha, ECNU, and UWaterloo use integrated classifiers, and the results of the classification are obtained through a voting mechanism. e three models, DFKI-DKT, ECNU, and MamaEdha, use the mixture of deep learning, machine learning, and manual rules to assign different labels with different weights. All these compared models have used carefully designed feature engineering. IITP, NileTMRG, ECNU, and UWaterloo have utilized keywords and key sentences, as well as features in the Tweet (such as metadata, tags, and keywords for specific events). IKM and MamaEdha have used fewer features and exploited the word embeddings obtained from the CCN network.
e Turing model uses the LSTM network to implement sequence-to-sequence classification.
is model comprehensively considers the word embeddings, punctuation embeddings, and the similarity between words, and it has incorporated more feature information. Consequently, it has obtained the best result in baselines. Compared with all the baselines, our proposed method has incorporated more indepth features (such as improved BERT and syntactic dependency trees). And it has achieved the better result (1.5% higher than the best baseline). Further, our model has a better representation than all of them because syntactic structure plays a very important role in the text representation too. At the same time, our model takes the least amount of processing time.

Conclusions and Future Work
How to represent the text better is an important task in data mining and data analysis. is paper combines the existing research results and conducts a further study. In addition, we have proposed a novel model which has combined the improved BERT and grammatical dependency structure. Incorporating the deep semantic features into text representation, we can obtain a better sentiment analysis model. First of all, we have constructed a better text representation by studying the grammatical structure and iBERT. en, we construct the syntactic dependency graph of words. Finally, extensive experiments have been performed on SemEval 2014 and SemEval 2017. Our model has achieved the stateof-art. Experiments show that syntactic structure plays an essential role in the text representation. e next step is to combine more deep-level features (such as the syntactic structure combined graph convolutional neural networks) for researching text and image sentiment analysis.
Data Availability e data and the authors' source code used to support the findings of this study will be available at https://alt.qcri.org/ semeval2017 (semeval2014) and https://gitee.com/hzxylwf/ model.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.