Hierarchical Self-Attention Hybrid Sparse Networks for Document Classification

Document classification is a fundamental problem in natural language processing. Deep learning has demonstrated great success in this task. However, most existing models do not involve the sentence structure as a text semantic feature in the architecture and pay less attention to the contexting importance of words and sentences. In this paper, we present a new model based on a sparse recurrent neural network and self-attention mechanism for document classification. Subsequently, we analyze three variant models of GRU and LSTM for evaluating the sparse model in different datasets. Extensive experiments demonstrate that our model obtains competitive performance and outperforms previous models.


Introduction
Text classification is one of the most important subtasks in natural language processing, which can be divided into short text classification and document classification according to the text length. Tradition methods of machine learning are often used for document classification in the past. However, it cannot fully express the semantic information of the text. With the development of deep learning, there are many updated methods to learn word vector representations, which can capture the semantic relations in the vector space. CNN, RNN, and attention mechanism are proven to have strong capabilities in sequence information.
us, these methods are widely applied to document classification.
More recently, Liu [1] extracted both the forward and backward n-gram features of the text via bidirectional convolutional operations. Pappagari [2] extended finetuning procedure of Bert to address one of its major limitations-applicability to inputs longer than a few hundred words for document classification. Yi [3] proposed a local and global context attention (LGCA) model and a multicontext attention (MCA) model to extract text feature. However, most of the updated methods lack the consideration of the importance for different sentences and words. Specifically, it was indicated that part of critical sentences and words in the document have a clear relation to the classification result. Moreover, these methods did not address sentences and words in a document separately, nor did they effectively select informative sentences and words.
Apparently, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) have been introduced in many hybrid models with elevated results. Although these models perform an outstanding result in several tasks, they introduce an increase in parameters and runtime due to the raising of gate units. us, it attempts to prune the external gates parameters to compress RNN models. ere are many researchers who designed structure sparse strategies in recurrent neural network model [4][5][6], but the effort in analyzing the large scale of datasets, especially in text classification, is lacking. In order to reduce computational expense while ensuring accuracy, we try to integrate the sparse strategy into the hybrid RNN model for document classification.
In this paper, we present a new model for the document classification approach to improve the model ability by selecting better representation through self-attention inspired by [7,8]. At the same time, we adopt three different sparse strategies to restructuring GRU and LSTM to ensure the effect while reducing redundant parameters. We proposed our model as Hierarchical Self-Attention Hybrid Sparse Network (HSAHSN) based on sparse bidirectional RNNs and self-attention. Our contributions can be concluded as follows: (1) We adopt self-attention layers to capture a deeper relationship for the contexting importance of words and sentences by hierarchical representation. (2) We propose three sparse RNN models by hierarchical representation on different scales of datasets, which decreased the parameters and runtime. (3) We evaluate our models on two document classification datasets, which demonstrates that our model obtains competitive performance and outperforms previous models.

Deep Neural Networks for Document Classification.
CNN can get similar n-gram information, as the number of convolutional neural network layers increases, the field of view of the convolution will also expand, and wider semantic information can be obtained. Kim [9] adopted multiple filters with different window sizes to extract multiscale convolutional features for text classification. Johnson and Zhang [10] proposed a low-complexity word-level deep convolutional neural network (CNN) architecture for text categorization that can efficiently represent long-range associations in text. Kim and Yang [11] proposed the sequence-to-convolution neural networks (Seq2CNN) and Gradual Weight Shift (GWS) method to stabilize training. RNN has a superior ability for sequence information, which suits for excavating semantic information and data with sequence characteristics. Wang and Tian [12] incorporated the residual networks [13] into RNN, which makes the model handle a longer sequence. Xu [14] proposed a novel LSTM with a cache mechanism to capture long-range sentiment information. Mikhail and Nikunj [15] proposed theà la carte embedding based on byte-level recurrent language models [16] achieve impressive efficiency on results.
Attention mechanism has been qualified to find out the important information of the sentence without interacting with the distance of words in different positions. Giannis and Antoine [17] represented documents as word cooccurrence networks and proposed an application of the message passing framework for document understanding. Manzil and Guru [18] proposed BigBird with a sparse attention mechanism that reduces this quadratic dependency to linear and showed that BigBird is a universal approximator of sequence functions.
Although recent deep neural networks have achieved great success in document classification, most existing models lack consideration of select task-friendly features on sentences and words by separating document to learning sentence representation. In addition, the traditional attention mechanism overly depends on external information, and we adopt self-attention [19] to capture the internal relation in words and sentences, which can replace sequence-aligned recurrence entirely.

Sparse Recurrent Neural Networks.
A number of researchers have reported the influence in sparse strategies of RNNs unit. e first category method [4,20]: the strategy of pruning filters is used for network compression. In particular, Xiong and Ling [5] used pruning strategies to preserve important connections during the training phase. Wen [6] decreased the memory requirements of LSTMs by altering the structure of LSTMs. Dey and Salem [21] evaluated three variants of the GRU in recurrent neural networks by reducing parameters in the update and reset gates. e most recent sparse methods on RNN have been applied to many tasks with superior results. We can integrate the sparse strategy into the hybrid RNN model for decreasing parameters and runtime while ensuring accuracy in document classification.

Hierarchical Self-Attention Hybrid Sparse Network (HSAHSN).
e architecture of our model has shown two main components: sparse word encoder and sparse sentence encoder, consisting of sparse bidirectional recurrent neural network and self-attention. e following two subsections describe the overall framework of our model in Section 2.1 and how we apply sparse methods in RNN cells in Section 2.2.

Framework of HSAHSN.
e HSAHSN model has three parts: Word embedding, sparse word encoder, and sparse sentence encoder. Figure 1 shows the main structure of the model, which concludes CNN, sparse bidirectional RNN, and self-attention mechanism.
In the word embedding layer, we adopt Fasttext [22] as the word embedding initialization. To extract different n-gram word representations in a sentence, we adopt different sizes of filter in the CNN layer after word embedding to extract more features and advance generalization ability. Given a sentence with words o it , t ∈ [1, n] and an embedding matrix W e trained by Fasttext, words are embedded to vectors through embedding matrix W e . Subsequently, embedding words are convoluted by various sizes of filter W f and get a concatenate sentence feature matrix c * i : where f(·) is a composite function including two cascaded operations: a convolution and a rectified linear unit (ReLU and concatenating the forward hidden state h w it �→ and backward hidden state h w it ← can explicate the word w it in a bi- ], which summarizes the information for the whole sentence. e importance of words in the sentence is different. Hence, the attention mechanism has been introduced to solve this problem. However, the traditional attention mechanism overly depends on external information and the effect is depending on the initialization and training of parameters.
To address the issue of extracting the internal relation of words and sentences in a document, we use the self-attention mechanism to evaluate the importance of words in a sentence. h w it are packed together into matrices Q it , K it , and V it . Specifically, where v is the sentence vector that capture all the information of words in a sentence. In this way, we can calculate the importance of words in a sentence and get the relevance of the words in the sentence directly. Given a sentence vector s i , we can get a document vector in the same way. Firstly, the s i input into bidirectional sparse RNNs: Subsequently, the model can obtain the bidirectional semantic information of each sentence in the document by . h s i is a vector that comprises nearby sentences but still concentrates on sentence i. However, we need to pay attention to the sentence that implies classifying the document exactly. We once again adopted the self-attention mechanism to obtain significant information at sentence level vector h i . en, h i are packed together into matrices Q i , K i , and V i : where s is the document vector that captures all information of the sentence vector in a document. Finally, we feed s into a fully connected layer with softmax and we can get a probability distribution over classes.

Sparse RNN Module.
In order to decrease training time and parameter amount while ensuring effectiveness, we adopt three sparse strategies for the RNN model. GRU and LSTM capture the state of sequences through gating structure, which can alleviate the gradient disappearance or explosion in traditional RNN for long sequence samples. Gating units were established to compute the sequential information flow and each gate parameter will be updated through overall network information. us, the contemporary state condition can absorb the information of preceding status and current input. However, the gating signals flow, which is the key of the situation of the network, probably involves redundancy information and it is possible to affect the comprehension of the model. In this study, we adopt three different variants of gating strategy for GRU and LSTM: Variant 1. Each gate unit only calculates the previous hidden state h t−1 with weight U and bias b. We called variant 1 GRU1 and LSTM1. GRU1: Variant 3. Each gate unit only calculates bias b without the previous hidden state h t−1 . We called variant 3 GRU3 and LSTM3. GRU3: In GRU or LSTM architecture, the recurrent hidden state can be expressed as follows: GRU: LSTM: where x t is a k-dimensional vector at t step. h t is a n-dimensional vector and n can be treated as the output size. W and U are the parameters for calculating x t and h t , respectively, and it will add bias b in the end. Subsequently, W, U, and b can be deduced to be an n × k, n × n, and n × 1 matrix, respectively. e total parameters of recurrent hidden state can be calculated as n 2 + n × k + n. In GRU cell unit, there are two gates named update gate z t and reset gate r t . ey have the same parameter structure as the previous recurrent gating unit. Specifically, in this case, the total parameters in GRU are equivalent to 3 × (n 2 + n × k + n) with recurrent gate when the input and output dimensions are k and n, respectively. In the LSTM cell unit, there are three gates named forget gate f t , input gate i t , and output gate r t . Total parameters in LSTM can be calculated as 4 × (n 2 + n × k + n). us, the above three strategy parameters can be calculated in Table 1.

Results and Discussion
In this section, we give the properties of the datasets and experimental settings in Section 3.1 and Section 3.2, respectively. Subsequently, we show our evaluation results on two datasets in Section 3.3. During the course of training, we plot and analyze the effect of variant models on convergence in Section 3.4. Moreover, the number of parameters for sparse bidirectional RNN and runtime in variant models are recorded in Section 3.5 and Section 3.6.

Datasets
. We evaluated our model on two document classification datasets: IMDb and Yelp 2018.
e IMDb dataset is composed of 25 K movie reviews for training data and test data, respectively, wherein the classification includes positive/negative reviews. Yelp 2018 includes 5 M full review text data about users' ratings from 0 to 5 stars for the comments of stores and services. To further explore the effects of variants below, we implement them based on different scale datasets. We split samples for 90% as train data and 10% as test data.

Input.
Word embedding adopted Fasttext to training with 200 as embedding dimension. e input document text is separated into sentences with padding fixed length 15 and the sentence is padded to a fixed length of 50.

Architecture Configuration.
e model is implemented with Keras. We adopt 3 different window sizes of filters for the convolution layer. Clear configuration of sparse word encoder and sparse sentence encoder is included in Tables 2 and 3. e classification layer is a fully connected MLP with a ReLU activation function and softmax output.

Training Settings.
We use Adam [23] optimizer with 64 as a batch size. e learning rate is initially set to 0.001. e training process lasts at 30 and 40 epochs on IMDb and Yelp 2018 datasets, respectively.

Result Comparison and Analysis.
In this section, we evaluate the HSAHSN model on two document classification datasets, which are IMDB and Yelp 2018, for three different variants of GRU and LSTM. Baseline models are distinguished with different RNN units, called HSAHN. In the subsequent description, we refer to three variants of GRU as GRU1, GRU2, and GRU3 and three variants of LSTM as LSTM1, LSTM2, and LSTM3, respectively. e results are listed in Table 4. Table 4 shows the accuracy over three variants of GRU and LSTM with the same configuration setting. ere are trends in our data to suggest that GRUs exhibit better accuracy performance over LSTMs about 1.89% to 4.44% in IMDb and 0.22% to 0.5% in Yelp. Simultaneously, after a series of experiments, it is noted that the ability of regular models is elevated after pruning in the IMDb dataset, which achieves 95.69% in HSAHSN + GRU3. Although baseline model HSAHN + GRU presents 73.48% as the best result in Yelp, HSAHSN + GRU2 also exhibits a near result to 73.28%, which is only decreased by about 0.2%. However, HSAHSN + GRU2 highly reduces the amount of parameter and runtime.
Compared with sparse models in Table 4, HSAHN + -GRU shows the best performance on the Yelp dataset. However, sparse models showed a very similar accuracy with HSAHN + GRU in Yelp. Moreover, sparse models improve upon baseline models by 2% to 4% in the IMDb dataset. is shows that our sparse methods can effectively drop redundancy information in the gate unit of RNN, which can elevate the comprehension ability of sequence information while reducing the model complexity. Table 5 shows the experimental results comparing with several genres of the popular model, wherein models with " * " contain attention mechanism. Our results are outperformed with various models below in Yelp 2018 and achieve a state of the art, wherein the proposed model obtained an accuracy of 73.48%. Moreover, our model was elevated slightly after pruning in IMDb, which enhanced about 0.68% compared with HAHNN while reducing the parameters.
Experimental result shows that our model outperforms HAHNN by 0.2% with self-attention in Table 5 and our models also give superior results to other models with attention mechanism, which shows that our proposed method can capture the contexting importance of words and sentences in documents.

Convergence Analysis.
To experimentally verify the convergence of variant models of RNN, we plot the loss and accuracy over time with different epochs when models are trained and tested in Figures 2 and 3, respectively. Figure 2 summarizes the results of loss and accuracy which show comparable performance among three variates of GRU and LSTM. Comparing with other models, HSAHSN + GRU3 achieves 95.85% accuracy particularly. During the course of training, GRU2 is shown to a similar performance with LSTM2. Specifically, the training loss of LSTM2 and GRU2 is elevated by about 0.1 and 0.4, respectively, compared with other models. Simultaneously, the accuracy is decreased by 0.15 and 0.03. e variant 2 model presents an inferior convergence and error estimate compared with other models apparently.
From Figure 3, all GRU variants appear to exhibit comparable accuracy performance in Yelp. ree variants of GRU and LSTM exhibit lower performance in the interval of 0.2% to 0.26% and 0.28% to 0.53% comparing with the original GRU and LSTM model, respectively.
In Figures 2 and 3, we have discovered that bias exhibits an exclusive role in models. While gate units are disposed bias, it will be suppressed the comprehension ability of Mathematical Problems in Engineering 5 models such as GRU2. Bias can give offset compensation when the input distribution is not zero as the center and the stochastic gradient descent is implicitly used to carry information about the network state. ese may explain the relative success in using the bias alone in the gate signals.
3.6. Runtime Comparison. In order to assess runtime for various models in an epoch, we record the runtime in different scales of datasets as Table 7. Table 7 indicates the runtime for an epoch in the variants of GRU and LSTM in IMDb and Yelp. It considerably decreases the relative runtime by 24.03% to 34.42% and 32.94% to 50.59% for GRU and LSTM, respectively, in IMDb. en, in comparison with IMDb, the result shows the same effect on runtime in Yelp. e runtime is decreased from 29.22% to 38.19% in GRU and from 15.17% to 27.36% in LSTM. Our findings seem to demonstrate that RNN models exhibit a strong position in overall runtime. Specifically, the parallel computing ability is limited due to the special structure of RNN, which is the reason of RNN taking up a lot of computing time in the hybrid model.

Conclusions
In this paper, we propose the HSAHSN for document classification. e method is based on sparse RNN and selfattention mechanisms in the word and sentence level. We evaluate our models on Yelp 2018 and IMDb datasets for classification and adopt three sparse variants for GRU and LSTM to assess the effectiveness of models. e proposed model improves the text comprehension ability more than previous models on Yelp 2018 and IMDb. We also analyzed the number of parameters, the runtime, and the loss of two datasets in different sparse models.

Conflicts of Interest
e authors declare that they have no conflicts of interest.