Multistream BertGCN for Sentiment Classification Based on Cross-Document Learning

. Very recently, the BERTgraph convolutional network (BertGCN) model has attracted much attention from researchers due to its good text classifcation performance. However, just using original documents in the corpus to construct the topology of graphs for GCN-based models may lose some efective information. In this paper, we focus on sentiment classifcation, an important branch of text classifcation, and propose the multistream BERTgraph convolutional network (MS-BertGCN) for sentiment classifcation based on cross-document learning. In the proposed method, we frst combine the documents in the training set based on within-class similarity. Ten, each heterogeneous graph is constructed using a group of combinations of documents for the single-stream BertGCN model. Finally, we construct multistream-BertGCN (MS-BertGCN) based on multiple heterogeneous graphs constructed from diferent groups of combined documents. Te experimental results show that our MS-BertGCN model outperforms state-of-the-art methods on sentiment classifcation tasks.


Introduction
Sentiment analysis, also known as opinion mining, is a basic task in natural language processing, which refers to the process of analyzing, processing, and extracting subjective texts with sentiment color using natural language processing and text classifcation technology.Sentiment analysis is diferent from traditional text information processing in that traditional text information processing only focuses on the description, while sentiment analysis focuses on the emotional information embodied in the text and extracts the relevant point of view elements.With the rapid development of various network platforms, people are more and more likely to express their opinions on a certain event or a certain commodity on the Internet, and they also like to express their feelings.Terefore, there is a lot of text information on the Internet.How to process text information efciently and accurately from these large amounts of text data and analyze the user's emotional tendency is very important.Tis task is sentiment classifcation, which is one of the core tasks of sentiment analysis, generally classifying texts with subjective sentiment.Sentiment classifcation is widely used in opinion mining [1], public opinion poll [2], product analysis [3], movie recommendation [4], opinion retrieval, and other felds [5], which can extract hierarchical features of text and mine sentimental tendencies of users.In recent years, sentiment classifcation has become a popular research topic in the feld of natural language processing, which has been widely concerned by scholars and has important academic and commercial research value.Te traditional sentiment classifcation methods are mainly based on machine learning and dictionary-based sentiment analysis.However, with the rapid development of the Internet and the rapid change of network words, the method based on sentiment dictionary needs a lot of manpower and material resources to update the sentiment dictionary, which has certain limitations.Te method based on machine learning needs to rely on manual annotation of text, and it is difcult to learn deep semantic information of text.Terefore, deep learning has rapidly become the mainstream method for sentiment classifcation of text with its advantages of easily processing a large amount of data and strong generalization ability.
Te research on deep learning used in sentiment classifcation began in 2006, when Hinton [6] proposed a fast learning algorithm that introduced hierarchical structure into a neural network.Tis method can perform feature learning well and solve the complex training problems of deep neural networks.For the study of convolutional neural networks, Kim [7] carried out experiments on English texts using the convolutional neural network (CNN) model, and the experimental results showed that the classifcation accuracy of the CNN model was higher, so CNN began to explore the text classifcation task.After the model is proposed by Zhang and Byron [8], it is also proposed to use the CNN model for text classifcation.Diferent from Kim, the sentences in this model are still arranged in the form of a sentence matrix rather than converted into vectors.Based on the combined storage concept and the theory of distributed parallel processing of the Hopfeld network, Jordan and Michael [9] proposed the recurrent neural network (RNN).Recently, large-scale pretraining has demonstrated its efectiveness in a variety of NLP tasks [10].Te large-scale pretraining model is trained on a large-scale unlabeled corpus in an unsupervised manner and can learn the implicit but rich textual semantics of a language on a large scale.By combining the advantages of large-scale pretraining and inductive learning, Lin et al. [11] proposed a text classifcation model named BertGCN, which constructed a heterogeneous graph with word or document nodes for the corpus, initialized node embeddings with trained BERT representations, and used graph convolutional network (GCN) for classifcation.
Many GCN-based methods have achieved high performance in text classifcation.However, these models directly use the documents and words of the original data as the nodes of the graph, which may ignore some useful information in the dataset.For example, if there are large diferences between the documents in a certain class, this class will not have strong discrimination compared with other classes, which may lead to difculty in classifcation between diferent classes of the dataset.In order to better solve this problem, we propose multistream BertGCN (MS-BertGCN) classifcation model based on cross-document learning and apply the model to sentiment classifcation.Specifcally, we frst perform a combination of documents in the training set using within-class similarity.In each class, we calculate the similarities between diferent documents and combine the documents with the lowest similarity to obtain groups of combined documents.Ten, we use each group of combined documents to train single-stream BertGCN.Finally, we construct our multistream BertGCN (MS-BertGCN) to obtain higher classifcation accuracy by fusing the classifcation scores of all single-stream BertGCNs.

Related Work
With the development of deep learning technology, deep learning models for text classifcation have become the mainstream solution.In the sentiment analysis method based on deep learning, researchers have proposed convolutional neural network (CNN) [8], recurrent neural network (RNN) [12], long short-term memory network (LSTM) [13], and other neural networks for better classifcation.As one of the important models in deep learning, CNN was frst proposed by Fukushima in 1980.It is not only widely used in the feld of computer vision but also widely used in the feld of natural language processing (NLP).Zhang and Byron [8] successfully applied CNN to the sentiment analysis task for the frst time.Tang et al. [13] proposed to use a long short-term memory network (LSTM) to model the emotional relationship between sentences, which solved the defects of gradient disappearance and gradient explosion.Huang et al. [14] proposed that syntactic knowledge can be encoded in neural networks (RNN and LSTM) and experiments showed that it was efective in improving the accuracy of sentiment text classifcation.Zhang et al. [15] proposed a three-way enhanced convolutional neural network model named 3W-CNN and achieved better text classifcation performance than CNN.Xing et al. [16] proposed a novel parameterized convolutional neural network for aspect-level sentiment classifcation, and experiments demonstrated that CNN-based models achieve excellent results on sentiment datasets.HEAT model further used the hierarchical attention mechanism to capture aspect information to complete sentiment analysis of specifc aspects of sentences, so as to improve the accuracy of fne-grained sentiment analysis.Mohan and Yang [17] [20] is improved based on spectral graph convolution, using the degree matrix and adjacency matrix of graphs instead of complex eigen-decomposition operations.Te author also proposes a layered linear model, which restores the expressiveness of the convolution flter function by stacking multiple convolutional layer and gets rid of the limitation of explicit parameter setting limited by Chebyshev inequality.GCN can alleviate the overftting problem of adjacent regions structure of nodes in the classifcation problem by using a wider node distribution.In addition, from the perspective of operation cost, layered linear operations can further build a deeper network model.TextGCN [21], which is proposed by applying graph convolution to text classifcation tasks, constructs the whole corpus into a large topological graph, with words and documents in the corpus as points in the graph and the relationships between words and documents as edges in the graph.Te document-word edge is constructed by word frequency and document frequency of words; word-word edges are based on the word's global cooccurrence information.Te word cooccurrence information is counted by sliding a fxed-size sliding window in the corpus, and then the weight of the connection between the two word nodes is calculated using the node mutual information (pointwise mutual information, PMI).TextGCN can capture the relationship between documents and words and the global cooccurrence information of words, and the label information of document nodes can also be passed to other words and documents through neighbor nodes.Te classifcation experiment results show that the model can be better than some existing text representation models, such as CNN, LSTM, and Fasttext [22], as well as some models based on graph network, including SK-GCN [23], AGCN [24], graph-CNN-c [25], graph-CNN-s [26], and graph-CNN-f [27].By improving GAT, a BiGAT model [28] was proposed to describe the contextual information of sentences.Experiments prove that BiGAT can efectively improve the speed of text classifcation and ensure the accuracy of text classifcation.

Pretraining Model.
Pretrained language models (PLMs) are currently the most powerful models for natural language tasks.Te model can perform unsupervised pretraining of network models using a large-scale unlabeled text corpus, and the trained network can be fne-tuned directly in various downstream NLP tasks to obtain better results.BERT is a kind of pretraining model, which uses the coding layer of the transformer network structure as the main framework of the algorithm.By proposing masked language model (MLM) and next sentence prediction (NSP), multitask training of NSP realizes the two-way propagation of data fow and solves the problems existing in the one-way language model efectively.In addition, the model takes results to a whole new level by using more powerful machines to train larger amounts of data and generate higher-quality textual representations for downstream tasks.Tis model was frst proposed in 2018 [10] and refreshed the optimal results of 11 tasks of NLP.
RoBERTa [29] introduced the dynamic mask mechanism and deleted the next sentence prediction task in the pretraining stage.In addition, the model also increased the training data and some training parameters, such as sequence length and the number of texts per training.Te SpanBERT model [30] improved BERT's MLM pretraining task.First, the model did not randomly cover a single word like BERT but randomly cover a continuous range of words.Second, the model incorporated the span boundary objective (SBO) [31].Wang et al. [32] proposed a novel structural pretraining that extends BERT by combining word structure goals with sentence structure goals to utilize linguistic structure information in contextual representation.Te model can explicitly model the language structure by forcing it to reconstruct the correct word and sentence order for prediction.In order to improve BERT's running time, a new method of knowledge distillation is proposed to compress the model, which not only saved running time and memory but also ensured strong computing power [33].Te MobileBERT model [34] and the Bert-large-like number of layers added a bottleneck mechanism to the transformer in each layer.Although the mechanism made the transformer in each layer narrower, the model did not lose its balance with the selfattention and feedforward layers.Ten, a new layer-by-layer  Quantum Engineering adaptive mass optimization technique [35] was proposed, which could reduce BERT's training time from 3 days to 76 minutes.ALBERT [36] is a lightweight BERTmodel, which improves the traditional BERT model in two aspects: the frst aspect is mainly to reduce the parameters and running time of the traditional model.Te second aspect is mainly to improve the accuracy of the model in processing downstream tasks.
During pretraining, ALBERT replaced the traditional next sentence prediction task with a sentence order prediction task (Sentence Order Prediction, SOP).Tis method not only makes it easier to generate pretraining samples but also improves the accuracy of the pretrained model in downstream NLP tasks.More recently, a more efcient pretraining task and framework [37] was proposed, which efectively combine BERT with a GAN-like structure.Unlike BERT, this pretraining task enables the model to learn all the words in the input sentence, rather than just the obscured words.Terefore, this method enables the model to learn more detailed semantic information more efciently.Diferent from the above methods, our model utilizes the fusion of multiple BertGCNs for sentiment classifcation based on cross-document learning which can retain more rich information from the corpus.

Method
In this section, we describe our proposed model in detail.Multistream BertGCN model (MS-BertGCN) is obtained by fusing multiple single-stream BertGCN models.Each singlestream BertGCN model is constructed based on the BERT module and GCN module according to the method of Lin et al. [11].
Te overall algorithm is shown in Algorithm 1.Among them, lines 2-5 are the construction process of the singlestream BertGCN model.

Construction of a Graph
3.1.1.Combination of Documents.We frst build a corpus according to a certain class of documents in the training set and use the TF-IDF model to process the corpus.Ten, we calculate the similarities between the within-class of documents in the corpus and combine the documents with the lowest similarity.Finally, we repeat the above steps to obtain groups of combined documents.
Considering that the sentiment classifcation is based on the semantics of the text, the similarity measure between documents is based on the semantic similarity, and cosine similarity is used to calculate the similarity of documents.Te common distances for semantic similarity measure include Euclidean distance and cosine distance.Cosine distance measures the relative diference in direction, while Euclidean distance measures numerical diferences.For sentiment classifcation, cosine similarity is more suitable.
where m is the number of documents in each class and t � m 2   means that each two of the m documents are selected to calculate the similarity of documents.Based on the documents (combined documents in training and original documents in the testing set), we then construct a heterogeneous graph for the proposed model.

Heterogeneous Graph Construction.
We need to build a heterogeneous graph composed of nodes and edges.Tere are two types of nodes in the graph network: documents (combined documents in training and original documents in the testing set) and words, where words are nonrepeating words in documents.Te weight between words and documents is defned by TF-IDF (term frequency inverse document frequency) and the weight between words is defned by PMI [38] (positive point wise mutual information).

Node Feature Initialization.
Te input feature matrix of our model is defned as follows: where n doc represents the number of documents in the graph, n word represents the number of words, and d represents the dimension of the feature vector of the nodes.
In order to take advantage of the ability of the BERT to pretrain on large-scale unlabeled corpus, we initialize all document nodes X doc of our GCN with BERT.For a fair comparison, we initialize all word nodes of GCN to zero instead of using a random initialization strategy, as in [11].
After obtaining the feature vectors of nodes, X is input into our GCN model based on the built heterogeneous graph to train the GCN model.Te output of the i-th GCN layer is calculated as follows: where ρ represents the activation function,  A represents the normalized adjacency matrix, W (i) ∈ R d i−1 ×d i represents a weight parameter, and L (0) � X represents the input feature of this model.
After graph propagation, the output of the last layer of GCN is used as the input of softmax: where g(•) is the graph model.Te model is trained using the standard cross-entropy loss function.
Te GCN layer operation in our proposed MS-BertGCN models has complexity O(|E|FC), where |E| represents the number of edges of the graph, F represents the number of Quantum Engineering convolutional kernel parameters, and C represents the dimension of each feature vector.
Te general idea of the BertGCN model is to use Bertstyle models (such as BERT and RoBERTa) to initialize the features of document nodes in the text graph.Tese features are used as inputs to the GCN.Ten, the GCN will be used to iteratively update the document feature according to the graph structure, and its output is taken as the fnal feature of the document node and sent to the softmax classifer for prediction.Our model takes full advantage of the complementary strengths of pretraining processing and graph models.Te single-stream BertGCN based on crossdocument learning is shown in the red box of Figure 2.

Prediction of Interpolation.
Since BERT and GCN process data diferently and have diferent model sizes, directly combining them cannot lead to model convergence.In addition, a model with too large BERT cannot load all the nodes of the entire graph at one time, which hinders the training of BertGCN.
According to the method of Lin et al. [11], we add up the two document embeddings obtained from GCN and BERT acting separately on the text to get the fusion classifcation.
when we use λ to control the trade-of between the two prediction objectives.When λ � 1, the BERT module is not updated; when λ � 0, the GCN module is not updated; when λ ∈ (0, 1), both modules can be updated, and the BertGCN overall module is adjusted by adjusting λ to achieve rapid convergence of the overall module.

Memory Storage and Small
Learning Rate.Due to the existence of BERT, BertGCN can only load one batch instead of the entire graph at a time during training, and the memory limitation prevents the full-batch method from being applied to BERT.To this end, BertGCN uses memory bank technology to solve this problem.Te memory repository M saves the features of all document nodes, separates the graph nodes from each batch during training, and each batch only needs to take a small part of the node features from it.Specifcally, the memory storage mechanism is implemented as follows: Step 1: At the beginning of each epoch, store the document node calculated by the current BERT module in the memory repository M; Step 2: At each iteration, for the document subscript set B � b 0 , . . ., b n selected by each Batch, use the current BERT module to calculate their document features M B , and update in M; Step 3: Te updated M is used as the input of the GCN module to calculate the loss and train the model; Step 4: During backpropagation, only the document nodes in the B are updated, and other nodes in the M remain unchanged.
In other words, the memory storage mechanism dynamically updates a small set of document nodes with each iteration and uses this set of nodes to train the model.Tis avoids reading all features into BERT for calculation at one time, greatly reducing memory overhead.However, since the document nodes are updated in batches, the features input to the model will appear inconsistent in diferent iteration steps of an epoch.To this end, BertGCN adopts a smaller learning rate when updating the BERT module to reduce the inconsistency between features.To speed up training, BertGCN also initializes the BERT module in BertGCN with a BERT model trained on the downstream dataset before training.

Multistream Bert Graph Convolutional Network. Te multistream Bert graph convolutional network (MS-
BertGCN) model that we proposed combines multiple independent BertGCN based on cross-document learning and fuses the softmax scores of each graph convolutional neural network to obtain the fnal prediction results.For each group of combinations of documents, the softmax score R i (i � 1, . .., n) of a test set can be obtained respectively, and our MS-BertGCN model can be obtained by where α i represents the weight of R i and F represents the score of fusion.Finally, we obtain the prediction of the original documents in the test set according to the value of F. Te schematic of the MS-BertGCN model is demonstrated in Figure 2.

Experiments
4.1.Datasets.In this paper, three widely-used sentiment analysis datasets are applied for experiments: Movie review (MR) and the Stanford sentiment treebank (SST-2).Table 1 shows the summary statistics of these datasets.
(i) MR.Te movie review dataset is a binary classifcation of English movie review data, with a total of 10662 samples.Each sentence in this dataset is positive and negative (denoted as 0 and 1), according to the sentiment class, there are 5331 positive statements and 5331 negative statements.(ii) SST-2.Te Stanford Sentiment Treebank (SST) is an English text sentiment classifcation dataset for movie reviews published by Stanford University.User reviews are divided into fve levels: very negative, negative, somewhat negative, neutral, somewhat positive, positive, and very positive.After sorting on this basis (neutral comments were deleted, very positive and positive comments were marked as positive, and negative and very negative comments were marked as negative), the binary classifcation dataset SST-2 was obtained, which had a total of 9613 samples.
In our experiments, we use the same data preprocessing procedure and training/test splits as in the paper of Lin et al. [11].For each dataset, we randomly sample 90% of the training set samples as the real training set, and the remaining 10% is used as the validation set.

Baseline.
In our experiments, we use the baseline results of Lin et al. [11] and Yao et al. [21].In order to prove the efectiveness of the proposed MS-BertGCN model, we compare our model with the conventional CNN model, LSTM model, and Bi-LSTM model as well as some advanced pretraining and GCN models: TextGCN, SGC, BERT based , and RoBERTa.Te details of each model are as follows: 4.2.1.CNN.Te convolutional network model [7], which uses diferent convolution cores to convolve corpus, extract features, and fnally input a pooling layer for classifcation, including the standard CNN for sentence classifcation, CNN-rand with random initialization word embeddings, and CNN-non-static with pretrained word embeddings.

LSTM.
LSTM is a one-way LSTM network model [39,40], which can only sequentially extract features from the corpus from front to back and use the last hidden layer vector to update parameters.

Bi-LSTM.
Tis model is a bidirectional LSTM [41], which is an improvement of the traditional LSTM.It includes a forward layer and a backward layer and connects two LSTM networks with opposite timing to the same output.Te forward layer can obtain the historical information of the input sequence, and the backward layer can obtain the following information about the input sequence.In order to better compare the model performance, we designed each stream of the model according to the settings of Lin et al. [11].For BERT and RoBERTa, we use the output features of [CLS] as document embeddings, then use the feedforward layer to derive the fnal prediction and use BERT based and two-layer GCN to implement BertGCN.For the learning rate, we initialized the GCN to 1e − 3 and the fne-tuned Bert to 1e − 5. We also implemented our model using Roberta and GAT. 2 shows the test accuracy of each model on the three datasets.As shown in Table 2, our proposed MS-BertGCN and MS-RobertaGCN have achieved better classifcation results than the baseline models in these datasets, which indicates the efectiveness of our proposed method in sentiment classifcation.Te accuracy of using only BERT or RoBERTa is higher than that of TextGCN and SCG on MR due to the huge advantages of large-scale pretraining.On these datasets, the accuracy of BertGAT is lower than that of BertGCN because in GAT, edge weight is calculated by attention instead of TF-IDF and PPMI, which will lead to the loss of edge weight information, thus the efect is not as good as that of BertGCN.

Te Efect of λ.
Te trade-of between GCN and BERT is controlled by λ during training, and the best value of λ may be diferent for diferent datasets.Figure 3 shows the accuracy of RoBERTaGCN under diferent λ.On SST-2, the accuracy always increases with the larger value of λ, due to the high performance of the graph-based method.When λ � 0.7, the model reaches the best performance.

Conclusion and Future Work
In this work, we propose the MS-BertGCN sentiment classifcation model based on cross-document learning.Firstly, we combine the documents in the training set based on similarity.Ten, we group the combined documents to train BertGCN models.Finally, we fuse these BertGCN models to construct multistream BertGCN (MS-BertGCN) based on cross-document learning.Te experimental results show that our proposed model can achieve the state-ofthe-art performance on sentiment classifcation task.Considering our MS-BertGCN models just adopt a mini-batch gradient descent approach for training, loading larger corpora into our models may lead to lower training efciency.To deal with the memory restrictions of our models, it would be interesting to research how to simplify the model  Quantum Engineering parameters and optimize the efect of the combination between the BERT and GCN models for this task in future work.

Figure 1 :
Figure 1: Te schematic of our MS-BertGCN based on cross-document learning.

Figure 2 :
Figure 2: Te schematic of our MS-BertGCN based on cross-document learning.

Figure 3 :
Figure 3: Accuracy of RoBERTaGCN when varying λ on the SST-2 development set.
BERT [10]is a large-scale pretrained NLP model, including BERT based and BERT LARGE .

Table 1 :
Summary statistics of datasets.

Table 2 :
Test accuracy of diferent models.We run all models 10 times and report the mean test accuracy.Te bold values given in Table2represent the optimal test accuracy for diferent datasets.