A Hierarchical Neural-Network-Based Document Representation Approach for Text Classification

Document representation is widely used in practical application, for example, sentiment classification, text retrieval, and text classification. Previous work is mainly based on the statistics and the neural networks, which suffer from data sparsity and model interpretability, respectively. In this paper, we propose a general framework for document representation with a hierarchical architecture. In particular, we incorporate the hierarchical architecture into three traditional neural-network models for document representation, resulting in three hierarchical neural representation models for document classification, that is, TextHFT, TextHRNN, and TextHCNN. Our comprehensive experimental results on two public datasets, that is, Yelp 2016 and Amazon Reviews (Electronics), show that our proposals with hierarchical architecture outperform the corresponding neuralnetwork models for document classification, resulting in a significant improvement ranging from 4.65% to 35.08% in terms of accuracy with a comparable (or substantially less) expense of time consumption. In addition, we find that the long documents benefit more from the hierarchical architecture than the short ones as the improvement in terms of accuracy on long documents is greater than that on short documents.


Introduction
Text representation, as a challenging task in Natural Language Processing (NLP), coverts text spans into real-valued vectors or matrix, which is crucial for machine to understand the semantics of the text.From the text generation frame (words form phrases or sentences, and sentences form a document [1]), text representation can be divided into the following levels: word-level representation (e.g., word2vec [2], GloVe [3]), phrase-level representation [4], sentence-level representation [5], and document-level representation [6].In this paper, we focus on document-level representation.The documentlevel representation has broad applications such as sentiment classification [7], text retrieval [8], and text ranking [9].
The most common and simplest approaches for text representation are bag-of-words (BoW) [10] and -grams with TF-IDF [11].However, such statistical-based methods suffer from the problem of data sparsity and dimensionality when they are applied on a large-scale corpus.Recently, plenty of approaches based on different neural-network architectures or on their combinations have been proposed for text presentation, for example, FastText (a hidden layer based) [12], TextCNN (convolutional neural networks based) [13], TextRNN (recurrent neural networks based) [14], and TextRCNN (recurrent convolutional neural networks based) [15].Such neural-network-based models can generate lowdimensional vectors to represent text, overcoming the problem of data sparsity.In addition, compared to BoW and grams based approaches, neural-network-based models can capture a better semantic relationship between words [12].However, the existing neural-network-based text presentation models are individually trained for one or multiple specific tasks, for example, sentiment analysis [16] or text classification [14], ignoring the internal structure features of the text itself, for example, the words-sentence and sentencestext relationship, which we argue can be regarded as prior knowledge to help generate a better text representation.
Hence, in this paper, we propose a general structure for document representation by injecting hierarchical architecture into neural networks.Our proposal mainly consists of a sentence representation at the word level and then a document representation at the sentence level.At the word 2 Mathematical Problems in Engineering level, each sentence can be represented by utilizing specific neural networks to aggregate the embeddings of words in the sentence.Similarly, at the sentence level, the document is represented by aggregating all sentences generated from the former step.We implement our proposal on public largescale text datasets for document classification.Our experimental results indicate that the hierarchical architecture does help to improve the performance after being incorporated into the existing neural networks based baselines, for example, FastText [12], TextCNN [13], and TextRNN [14].
Our major contributions are summarized as follows: (i) We tackle the challenge of document representation for text classification by incorporating the hierarchical architecture into neural networks models.(ii) We theoretically analyze the computational complexity of our new neural models after injecting the hierarchical architecture into existing neural networks models.(iii) We conduct comprehensive experiments for document classification on large-scale public datasets.We find that our proposals significantly outperform the corresponding state-of-the-art baselines, achieving an improvement of around 8% in terms of accuracy with comparable or substantially less computation expense.
The remainder of this paper is organized as follows: we describe the related works in Section 2. Our proposals are described in Section 3. Section 4 presents our experimental setup.In Section 5, we report and discuss our experimental results.Finally, we conclude in Section 6.

Related Work
In this section, we briefly summarize the approaches for document classification based on various text representation schemes, that is, the traditional statistical representation (see Section 2.1) and the neural-network-based representation (see Section 2.2).In addition, we present the major differences between our proposal and previous works.

Statistical Representation Based Document Classification.
As a word is the most basic unit of semantics, the traditional one-hot representation model converts the word in vocabulary into a sparse vector with a single high value (e.g., 1) in its position and all the others low (e.g., 0), which is employed in the bag-of-words (BoW) model [10] to reflect the word frequency information.However, the BoW model can only symbolize the words and cannot reflect the semantic relationship between words.In view of that, the bag-of-means model [11] is proposed to cluster the words embeddings learned by the word2vec model [2] for text representation.Furthermore, the bag-of-grams [11] is developed to take the -gram words orders into account for text representation, which selects the most frequent -grams (up to 5 grams) as the vocabulary in the BoW model.In addition, with some extra statistical information added to the BoW model, for example, TF-IDF [17], a better text representation is achieved.Besides, incorporating the text features into the representation learning process, for example, the noun phrases [18] and the tree kernels [19], can further improve the accuracy of document classification.
Clearly, a progressive step has been made to text classification based on statistical representation.However, such traditional statistical representation approaches inevitably suffer from the problem of data sparsity and dimensionality, which leads to no applications to large-scale corpus.In addition, such approaches are simply built on shallow statistics and deeper semantic information of text has not been well developed.Instead, our proposal in this paper based on deep learning of neural network has the ability to learn the low-dimension vectors to overcome such problems.

Neural Representation Based Document Classification.
Since Bengio et al. [20] first employed the neural-network architecture to train the language model, a great attention has been paid to proposing the neural-network related models for document classification.For instance, the FastText model proposed by Joulin et al. [12] uses 1 hidden layer to integrate all input information and present considerable results.However, this model only concerns the mean of word vectors and discards the signal of word order.In order to overcome the problem of insufficient training data which often appears in a single-task supervised learning, Liu et al. [14] use the multitask learning framework with RNN structure to jointly learn across multiple related tasks.Compared to RNN, CNN is easier to train and capture the global text information.For instance, Kim [13] employs the CNN to classify documents, while Zhang et al. [11], in character level, also employ a CNN to represent documents.Furthermore, a combination of these neural-network models can integrate the advantages of a single neural-network.For instance, RCNN is proposed to adopt the recurrent structure to grasp the context information and can identify the key components in text by employing a max-pooling layer [15].
Although the neural-network models above utilize a complex neural-network structure to deep learning and are able to develop the hidden text features, such models are built on one or multiple tasks lack of interoperability.In addition, such approaches directly employ the neuralnetwork architecture to get the document representation vectors without considering the structure features of texts, making models less interpretable.Instead, our proposal pays special attention to the process of generating texts, which is based on hierarchical architecture and can improve the interpretability of neural network based models.

Approach
In this section, we first formally describe our proposal, that is, the hierarchical neural representation in Section 3.1.Then, in Section 3.2, we detail three new proposed models based on our hierarchical neural architecture, that is, TextHFT, TextHRNN, and TextHCNN, which basically combine the  hierarchical architecture with the corresponding models, that is, FastTaxt, RNN, and CNN, respectively.In addition, a comprehensive computational complexity analysis is conducted on all discussed models.

General Framework.
First of all, we propose a general framework for document representation with a hierarchical neural architecture.Figure 1 illustrates the major workflow of our proposal for document representation.Let us make a brief illustration.Given a document  that consists of  2 sentences with each sentence   having  3 words, we denote the document as  = { 1 ,  2 , . . .,   2 } and the sentence as Next, as indicated in Figure 1, our proposed hierarchical neural architecture for document representation mainly consists of six processes.
Word Representation.We use the word embedding to convert each word     into a real-valued vector w    (w    ∈ R  ),  is the dimension of word embeddings.
Word Combination.To integrate all words in a sentence   , the representation {w 1   , w 2   , . . ., w  3   } obtained from the previous step is used as input to the neural network at the word level.
Sentence Representation.By word combination, we can get the output of the neural networks and regard it as the representation s  of sentence   .
Sentence Combination.Similar to word combination, we use the representation of all sentences s  (s  ∈ R  ) with  = {1, 2, . . .,  2 } as the input of the neural networks at the sentence level.Document Representation.Likewise, we can get the output of the neural networks at the sentence level as the representation Document Classification.The document representation d can be used as input for document classification.After the document representation module, we transform d into d * = ( 1 ,  2 , . . .,   ) by a fully connected layer, where  indicates the total number of document categories.Then, we employ a softmax function to produce the predictive distribution q over all document categories, where its element   indicates the probability of that document belonging to a specific category :

Hierarchical Neural Representation.
In this section, we present our proposed document representation models with hierarchical architecture (i.e., TextHFT, TextHRNN, and TextHCNN) and make a detailed analysis to their complexity.For simplicity, suppose that we have  1 documents in the corpus, each document has the same length  2 , and each sentence has the same length  3 .

TextHFT.
As shown in Figure 2, the key component of FastText integrates all word representations on a hidden layer.Different from the traditional FastText that directly averages all word embeddings of the document, TextHFT first averages all word embeddings of a sentence to get the sentence representation and then averages all sentence representations to get the document representation.Thus, we can get at the word level.At the sentence level, we can then represent a document By doing so, each document can be represented by a vector.
From the above analysis, the complexity of this hierarchical architecture is mainly related to the sequence length.In particular, the complexity for FastText is (), where  =  1 ×  2 ×  3 .However, in the TextHFT model, the complexity at the word level is ( 1 ×  2 ×  3 ), that is, (), and that at the sentence level is ( 1 ×  2 ), respectively.In total, the complexity for TextHFT is ((1 + 1/ 3 )).

TextHRNN.
To overcome the problem of gradient disappearance and of context scarcity, we implement the bidirection Long Short-Term Memory RNN (Bi-LSTM) model to the text sequence (shown in Figure 3).

󳨀 → ℎ
and are the outputs of forward LSTM and backward LSTM, respectively.We then concatenate the  → ℎ to obtain a hidden output h After that, to encode the hidden output h    into the sentence representation s with a fixed length, we add a fully connected layer as where W , ∈ R 2× is a weight matrix and b , ∈ R  is a bias term.Similarly, at the sentence level, by using Bi-LSTM to the sentence sequence {s 1 , s 2 , . . ., s  2 }, we will produce where  = {1, 2, . . .,  2 }.Again, we concatenate each to obtain a hidden output h   as Then, the document representation d can be generated by where W  , ∈ R 2× is a weight matrix and b , ∈ R  is a bias term.
From the above model description, in TextHRNN, we can find that the major computation consumption is focused on the Bi-LSTM layer and the fully connected layer.In particular, the Bi-LSTM layer is a process of the cross product of input matrixes, so the complexity is proportional to the square of the sequence length, that is, ( 2  3 ) at the word level and ( 2 2 ) at the sentence level, while, for the fully connected layer, we mainly focus on reshaping the input matrix, resulting in a complexity proportional to the sequence length, that is, ( 3 ) at the word level and ( 2 ) at the sentence level.Clearly, as ( 2  3 ) ≫ ( 3 ) and ( 2 2 ) ≫ ( 2 ), the consumption of the fully connected layer can be ignored.Therefore, the complexity of TextRNN and TextHRNN is ( 1 ( 2  3 ) 2 ) and ( 1 ( 2  3 ) 2 (1/ 2 + 1/ 3 2 )), respectively.

TextHCNN.
Similar to [13] (shown in Figure 4), at the word level, we first convolute the word sequence {w 1   , w 2   , . . ., w  3    } with  = {1, 2, . . .,  2 } using different filter operators W ℎ ∈ R ℎ× with (ℎ = 1, . . ., ) to get the feature maps where  is the number of filter operators.In detail, the filter operator W ℎ ∈ R ℎ× in the convolution layer is applied to a window of ℎ words to produce a new feature  ℎ   , at position  of c ℎ   .That is actually done by convoluting a window of ℎ word embeddings w :+ℎ−1 where the notation (⋅) means the dot product,  is a nonlinear function, and  ℎ ∈ R is a bias term.After that, we employ the max-over-time pooling operation [21] to different feature maps c ℎ   to capture the most important feature cℎ   : Then, after concatenating all cℎ   with ℎ = {1, 2, . . ., } as the sentence representation s  can be generated by where W , ∈ R × is a weight matrix and b , ∈ R  is a bias term.
Similarly, at the sentence level, we convolute the sentence sequence (s 1 , s 2 , . . ., s  2 ) in document  using different filter operator W ℎ with (ℎ = 1, . . ., ) to get the feature maps c ℎ  as Again, we employ the max-over-time pooling operation to different feature maps c ℎ  to get the corresponding important feature cℎ  : Finally, the document representation d can be produced as by concatenating all cℎ  as c  = (c 1  , c2  , . . ., c  ) , where W , ∈ R × is a weight matrix and b , ∈ R  is a bias term.
From the above description, we can find that the main computation consumption is attributed to the convolution layer, the max-pooling layer, and the fully connected layer, which is only related to the sequence length, that is, .Thus, similar to FastText, the complexity of TextCNN and TextHCNN is () and ((1 + 1/ 3 )), respectively.
For clear description, we compare the complexity of discussed neural-network-based models in this paper with/ without the hierarchical architecture in Table 1.Typically, as  1 ,  2 , and  3 ≫ 1, from Table 1, we would like to say that adding the hierarchical architecture to FastText and TextCNN makes a slight change to the complexity.However, compared to TextHRNN, a significant decrease of complexity is observed when injecting the hierarchical architecture to TextRNN.

Experiments
In this section, we first describe the datasets used in our experiments in Section 4.1.We then present the research questions in Section 4.2 that guide our experiments.Next we provide the details about our evaluation metrics and baselines in Section 4.3 and detail our experimental settings and parameters in Section 4.4.

4.1.
Datasets.We implement our experiments on two largescale public datasets that can be used for document representation and classification, that is, Yelp 2016 and Amazon Reviews (Electronics).The statistics of the datasets are summarized in Table 2.For each dataset, we randomly sample 80% of the data for training, 10% for validation and the remaining 10% for test.
(i) Yelp 2016 is obtained from the Yelp Dataset Challenge in 2016 (https://www.yelp.com/dataset/challenge),which has five levels of ratings from 1 to 5. In other words, we can classify the documents into five classes.
(ii) Amazon Reviews (Electronics) are obtained from Amazon products data (http://jmcauley.ucsd.edu/data/amazon/).This dataset contains the product reviews and the metadata from Amazon from May 1996 to July 2014.Similarly, five levels of ratings from 1 to 5 are given to product reviews.
As shown in Table 2, the most notable differences between Yelp 2016 and Amazon Reviews (Electronics) lie in the number of documents and the size of vocabulary, which could have an impact on the performance of text classification.

Research Questions.
The research questions guiding our experiments are listed as follows.(1) Compared to the traditional text representation models, does the hierarchical architecture help to better represent documents?That is, can the neural-network models improve the document classification accuracy after being injected into the hierarchical architecture?(2) How does the number of sentences affect the classification performance of the proposed models with hierarchical architecture in terms of accuracy?(3) How does the document length affect the classification performance of the proposed models with hierarchical architecture in terms of accuracy?
Answers to these two questions would provide valuable insights into the utility of hierarchical architecture in neuralnetwork-based models for document representation and classification.

Models and Metrics.
The typical neural-network-based models for document classification, for example, FastText [12], TextRNN [13], and TextCNN [14], are taken into account as baselines in this paper.Correspondingly, we inject the hierarchical architecture to these baselines, leading to TextHFT, TextHRNN, and TextHCNN, respectively.
For evaluation, we use accuracy and time consumption as the metrics, where accuracy is a standard metric to measure the overall document classification performance and time consumption reflects the relative time needed for model training.In detail, the metric accuracy can be computed as accuracy = ∑  =1 sgn (predict () , ground truth ())  , (19) where  is the total number of test documents, sgn(, ) is a sign function (sgn(, ) = 1 when  equals ; otherwise, sgn(, ) = 0), ground truth() indicates the ground truth of the class label for document , and predict() returns the predicted class label for document  by where argmax(q) returns the class label of the maximal element in q that is a predictive probability distribution of a document over all classes (see Section 3.1).

Experimental Setup.
For data processing, in order to produce the hierarchical architecture, we split the documents into sentences and tokenize each sentence using the Stanford's CoreNLP [22].Besides, we discard the words with single characters and other punctuation.We randomly generate the word embedding matrix   which will be updated according to a stochastic gradient descent process, where we set the embedding dimension to 200 [23].For initializing the neural networks, we adopt the Xavier initialization approach to keep the scale of the gradients roughly the same in all layers [24].We use the cross entropy function as the loss function and set the batch size to be 30 (i.e., 30 documents) [1].Gradient clipping is adopted by scaling gradients when the norm exceeds a threshold of 5 [25].In addition, we use the stochastic gradient descent approach to train all models with learning rate 0.001 [23].In order to overcome the problem of overfitting, we set the number of batches to batch = 10000.
In addition, for TextHFT, in the hidden layer, we employ the mean layer to average all word embeddings.For Tex-tHRNN, the number of neural cells is set to 80 (80 LSTM cells in one layer) and 3 layers are deployed.In order to accelerate deep networks training, we adopt the batch normalization [26] in the model training process.For TextHCNN, the window size of ℎ words in filter W ℎ is designed to ℎ = {1, 2, . . ., 7} so as to fully take the word orders into consideration.Besides, we set the dropout rate in the dropout layer of our TextHRNN and TextHCNN models to 0.5 [27].

Results and Discussions
In Section 5.1, we examine the performance of our proposed models with hierarchical architecture on public datasets.Then, Section 5.3 zooms in on the effect on document classification by varying the document length.

Performance
Comparison.To answer RQ1, in Table 3, we present the experimental results of all discussed neuralnetwork-based models in this paper for document classification on Yelp 2016 and Amazon Reviews (Electronics), respectively.
Clearly, as shown in Table 3, on the Yelp 2016 dataset, our models with hierarchical architecture, that is, TextHFT, TextHRNN, and TextHCNN, obviously outperform the corresponding neural-network-based models, that is, FastText, TextRNN, and TextCNN, in terms of accuracy.In particular, TextHFT presents a modest improvement of 9.13% against FastText.TextHRNN shows a significant improvement of 35.08% against TextRNN.TextHCNN shows an improvement of 4.65% against TextCNN.It means that, compared to Fast-Text and TextCNN, TextRNN receives the greatest benefits from hierarchical architecture as a substantial improvement in terms of accuracy is observed by comparing TextHRNN against TextRNN.For complexity, compared to FastText, TextHFT presents competitive time consumption.Similar findings can be observed by comparing TextHCNN against TextCNN in terms of time consumption.One particularly interesting point is that TextHRNN shows a straight decrease in terms of time consumption when comparing to TextRNN, accounting for one-third of the relative time consumption of TextRNN.It could be explained by the fact that a sequential network, for example, RNN, favors a sequential input, which is optimized by the hierarchical architecture.These findings are consistent with the theoretical analysis on complexity in Section 3.2.
Similar findings can be observed on the Amazon Reviews (Electronics) dataset.In terms of accuracy, our proposals with hierarchical architecture, that is, TextHFT, TextHRNN, and TextHCNN, result in an improvement of 8.15%, 6.07%, and 7.35%, against the corresponding FastText, TextRNN, and TextCNN, respectively.In terms of time consumption, again, no obvious differences are observed when comparing TextHFT against FastText and TextHCNN against TextCNN.A slightly different finding is that TextHRNN shows onefourth of time consumption of TextRNN.Furthermore, we compare the results on different datasets produced by the same model.No obvious difference in terms of accuracy can be found.However, in terms of time consumption, we find that a dramatic drop is observed when comparing the result on the Amazon Reviews (Electronics) dataset against that of the Yelp 2016 dataset.It can be attributed to the fact that the Amazon Reviews (Electronics) dataset has a larger vocabulary than the Yelp 2016 dataset and a larger vocabulary will lead to a higher complexity.The outcomes of the main comparisons of our proposals against the baselines on the Yelp 2016 dataset and the Amazon Reviews (Electronics) dataset demonstrate that the hierarchical architecture does help to represent the document when being injected into neural-network-based models, which results in a better performance in terms of accuracy for document classification at a comparable (or substantially less) expense in terms of time consumption.

Impact of the Number of Sentences.
To answer RQ2, we manually group the documents according to the number of sentences, for example, (0, 5], (5,10], . . ., (25,30], and (30, +∞), and then examine the performance of our proposals as well as the baselines on groups of documents with various numbers of sentences.We plot the result in Figures 5(a) and 5(b) for Yelp 2016 and Amazon Reviews (Electronics), respectively.As shown in Figure 5(a), as the number of sentences increases, the performance of all discussed models decreases.It indicates that the number of sentences is an important factor that influences the classification performance.However, when the number of sentences increases, the margins between the original models and their corresponding hierarchical proposals are enlarged.The similar findings can be found on Amazon Reviews (Electronics).However, the gaps between our hierarchical proposals and their corresponding original models go up when the number of sentences increases.The aforementioned findings may be explained by the fact that the hierarchical architecture can alleviate the impact on the classification accuracy brought by the number of sentences.In other words, compared to the short documents, the long documents may benefit more from the hierarchical architecture.This leads us to investigate RQ3.

Impact of the Document
Length.To answer RQ3, we manually group the documents according to their length, for example, (0, 100], (100, 200], . . ., (900, 1000], and (1000, +∞), and then examine the performance of our proposals as well as the baselines on groups of documents with various lengths.We plot the results in Figures 6(a) and 6(b) for Yelp 2016 and Amazon Reviews (Electronics), respectively.
Clearly, on the Yelp 2016 dataset, in general, the hierarchical neural models, that is, TextHFT, TextHRNN and Tex-tHCNN, invariantly outperform the corresponding models, that is, FastText, TextRNN, and TextCNN, at all lengths.In particular, for FastText and TextHFT, as the document length increases, the accuracy slightly declines until (600-700].After that, a fluctuation is observed.Generally, from Figure 6(a), the relative improvement of TextHFT against FastText stays stable when the document length increases.For TextRNN and TextHRNN, a similar phenomenon can be found that the accuracy declines when the document length goes up.However, the relative improvement of Tex-tHRNN against TextRNN is enlarged as the accuracy gap between TextHRNN and TextRNN becomes larger when the document length goes up.Differently, the accuracy of TextCNN and TextHCNN increases first until (200-300] and then goes down monotonously.The relative improvement of TextHCNN against TextCNN similarly goes up when the document length increases. On the Amazon Reviews (Electronics) dataset, interestingly, all discussed models present their peak performance at the point of document length (100-200] and then show a decrease in terms of accuracy.However, the relative improvements of our proposals with hierarchical architecture against their corresponding neural-network models, that is, TextHFT versus FastText, TextHRNN versus TextRNN, and TextHCNN versus TextCNN, share the same rhythm with that observed on the Yelp 2016 dataset when the document length increases, that is, the relative improvement of TextHFT against FastText stays stable and that of TextHRNN against TextRNN (TextHCNN against TextCNN) goes up when the document length increases.It indicates that, compared to the short documents, the long documents benefit more from the hierarchical architecture, which is similar to the analysis in Section 5.2.
Interestingly, when the document length increases, the accuracy of models on Yelp 2016 does not vary as much as that on Amazon Reviews (Electronics).This difference may originate from the difference of # average words/sentence (see Table 2).Since the # average words/sentence of Yelp 2016 is nearly 3 times that of Amazon Reviews (Electronics), it means that if the document length increases at the same interval, the increased number of sentences on Yelp is far less than that on Amazon Review (Electronics).In addition, according to the findings in Section 5.2, when the number of sentences increases, the accuracy of all discussed models will decline.Thus, it may lead to the consequence that the accuracy does not vary too much as the document length increases over the Yelp 2016 dataset, while the benefit diminishes a lot over the Amazon Reviews dataset.

Conclusion
In this paper, we propose a general framework for document representation with a hierarchical neural architecture, which takes the text generation frame into consideration to better improve the interoperability for different tasks.In detail, we incorporate the hierarchical neural architecture into three traditional neural-network methods, that is, FastText, TextRNN, and TextCNN, leading to the new proposals, that is, TextHFT, TextHRNN, and TextHCNN, respectively.
Our experimental results on two public datasets, that is, Yelp 2016 and Amazon Reviews (Electronics), demonstrate that our new proposals significantly outperform the corresponding neural-network-based models without hierarchical architecture for document classification.In detail, our newly proposed models present a significant improvement ranging from 4.65% to 35.08% in terms of accuracy at a comparable (or substantially less) expense in terms of time consumption.In addition, we conclude that the long documents benefit more from the hierarchical architecture than the short ones as the improvement in terms of accuracy on long documents is higher than that on short documents.

Figure 1 :
Figure 1: The framework of the hierarchical neural architecture for document classification.

Figure 4 :
Figure 4: The structure of CNN.

Figure 5 :
Figure 5: Classification accuracy on documents with various sentence numbers.

Figure 6 :
Figure 6: Classification accuracy on documents with various length.

Table 2 :
Statistics of datasets (the vocabulary in datasets has gone through data cleaning, excluding single characters and punctuations as well as retaining only the lemmatized words).

Table 3 :
Performance comparison of all models.