Deep Learning-and Word Embedding-Based Heterogeneous Classifier Ensembles for Text Classification

. The use of ensemble learning, deep learning, and e ﬀ ective document representation methods is currently some of the most common trends to improve the overall accuracy of a text classi ﬁ cation/categorization system. Ensemble learning is an approach to raise the overall accuracy of a classi ﬁ cation system by utilizing multiple classi ﬁ ers. Deep learning-based methods provide better results in many applications when compared with the other conventional machine learning algorithms. Word embeddings enable representation of words learned from a corpus as vectors that provide a mapping of words with similar meaning to have similar representation. In this study, we use di ﬀ erent document representations with the bene ﬁ t of word embeddings and an ensemble of base classi ﬁ ers for text classi ﬁ cation. The ensemble of base classi ﬁ ers includes traditional machine learning algorithms such as naïve Bayes, support vector machine, and random forest and a deep learning-based conventional network classi ﬁ er. We analysed the classi ﬁ cation accuracy of di ﬀ erent document representations by employing an ensemble of classi ﬁ ers on eight di ﬀ erent datasets. Experimental results demonstrate that the usage of heterogeneous ensembles together with deep learning methods and word embeddings enhances the classi ﬁ cation performance of texts.


Introduction
Recently, text classification/categorization has gained remarkable attention of many researchers due to the huge number of documents and text available on the different digital platforms.Given a text or document, the main objective of text classification is to classify the given text into a set of predefined categories by using supervised learning algorithms.The supervised learning algorithms can be trained to generate a model of relationship between features and categories from samples of a dataset.Using the trained model, the learning algorithm can predict the category of a given document.
A text classification task consists of parsing of documents, tokenization, stemming, stop-word removal, and representation of documents in document-term matrix with different weighting methods, feature selection, and selection of the best classifiers by training and testing [1].Different methods are used for each of the subtasks given above.Each document is usually represented in vector space model (also called bag of words model).In vector space model, each document is represented as a vector that includes a set of terms (words) that appears in the document.The set of documents with their vector representation forms the document-term matrix.The significance of each term in a document is computed by utilizing different term weighting methods.Common term weighting methods include Boolean, term frequency, and TF-IDF weighting schemes.In addition, there has been a recent interest to employ word embeddings to represent documents.There are many types of supervised classifiers used in text categorization.A review and comparison of supervised algorithms are presented in papers [1,2].Some of the commonly used supervised algorithms include naïve Bayes (NB), k-nearest neighbours (k-NN), decision trees (DT), artificial neural networks (ANN), and support vector machines (SVM).Deep learning networks have also been attracting attention of researchers in text classification due to their high performance with less need of engineered features.
Another trend in machine learning is to increase the classification performance by using an ensemble of classifiers.In an ensemble system, a group of base classifiers is employed.If different types of classifiers are used as base learners, then such a system is called heterogeneous ensemble, otherwise homogenous ensemble.In this study, we focus on heterogeneous ensembles.An ensemble system is composed of two parts: ensemble generation and ensemble integration [3][4][5][6][7].In ensemble generation part, a diverse set of models is generated using different base classifiers.Naïve Bayes, support vector machine, random forest, and convolutional neural network learning algorithms are used as base classifiers in this study.There are many integration methods that combine decisions of base classifiers to obtain a final decision [8][9][10][11].For ensemble integration, we used majority voting and stacking methods.
In a previous paper [12], we presented an ensemble of heterogeneous classifiers to enhance the text classification performance.In this study, we try to expand our work by using deep learning methods, which have produced state-ofthe-art results in many domains including natural language processing (NLP) and text classification [13].One of the well-known deep learning-related methods is word embeddings.Word embeddings provide a vector representation of words learned from a corpus.It maps words into a vector of real numbers.While words with similar meaning are mapped into similar vectors, a more efficient representation of words with a much lower dimensional space is obtained when compared with simple bag-of-words approach.Convolutional neural network model (CNN) is another deep learning method employed in this study.CNN is added into our set of base classifiers in order to improve accuracy of the ensemble of heterogeneous classifiers.In addition, we use different document representation methods including TF-IDF weighted document-term matrix, mean of word embeddings, and TF-IDF weighted document matrix enhanced with addition of mean vectors of word embeddings as features.We have performed experiments on eight different datasets in which four of them are in Turkish.In summary, this paper utilizes and analyses an ensemble of classifiers including CNN model with word embeddings and different document representation methods to enhance the performance of text classification.
We have evaluated and discussed the performance of an ensemble of five heterogeneous base learners and two integration methods using three different document representation methods on eight different datasets on this paper.This paper is organized as follows.Section 2 gives related research on text categorization using ensemble systems, word embeddings, and CNN.Section 3 presents base learners and ensemble fusion methods used in experimental studies.Experimental setup and results are given in Section 4. Section 5 summarizes and discusses results and outlines future research directions.

Related Work
This section gives a brief summary of ensemble systems, pretrained word embeddings and deep neural networks related to text classification/categorization problem.High dimensionality of input feature space, sparsity of document vectors, and the presence of few irrelevant features are the main characteristics of text categorization problem that differs from other classification problems [14].
The study of Larkey and Croft [15] is one of the early works of applying ensemble systems to text categorization.They used an ensemble of three classifiers, k-nearest neighbour, relevance feedback, and Bayes classifiers to categorize medical documents.Dong and Han [16] used three different variants of naive Bayes and SVM classifiers.They compared the performance of six different homogenous ensembles and a heterogeneous ensemble classifier.Fung et al. [17] use a heterogeneous ensemble classifier that uses a dynamic weighting function to combine decisions.A pairwise ensemble approach is presented in [18] that achieves better performance than popular ensemble approaches bagging and ECOC.Keretna et al. [19] have worked on named entity recognition problem using an ensemble system.In another study done by Gangeh et al. [20], random subspace method is applied to text categorization problem.The paper emphases the estimation of ensemble parameters of size and the dimensionality of each random subspace submitted to the base classifiers.Boroš et al. [21] applied ensemble methods to multiclass text documents where each document can belong to more than one category.They evaluated the performance of ensemble techniques by using multilabel learning algorithms.Elghazel et al. [22] propose a novel ensemble multilabel text categorization algorithm, called multilabel rotation forest (MLRF), based on a combination of rotation forest and latent semantic indexing.Sentiment analysis with ensembles is currently a popular topic among researchers.For Twitter sentiment analysis, an ensemble classifier is proposed in [23] where the dataset includes very short texts.A combination of several polarity classifiers provides an improvement of the base classifiers.In a recent study, the predictive performance of ensemble learning methods on Twitter text documents that are represented by keywords is evaluated by Onan et al. [24] empirically.The five different ensemble methods that use four different base classifiers are applied on the documents represented by keywords.
Representation of text documents with the benefit of word embeddings is another current trend in text classification and natural language processing (NLP).Text representation with word embeddings has been successful in many of NLP applications [13].Word embeddings are lowdimensional and dense vector representation of words.Word embeddings can be learned from a corpus and can be reused among different applications.Although it is possible to generate your own word embeddings from a dataset, many investigators prefer to use pretrained word embeddings generated from a large corpus.The generation of word embeddings requires a large amount of computational power, preprocessing, and training time [25].The most commonly used pretrained word embeddings include Word2Vec 2 Complexity [26,27], GloVe [28], and fastText [29].In this study, we use pretrained Word2vec embeddings in English and Turkish.Supervised deep learning networks can model high-level abstractions and provide better classification accuracies compared with other supervised traditional machine algorithms.For this reason, supervised deep learning architectures have received significant attention in NLP and text classification recently.Kalchbrenner et al. [30] described a dynamic convolutional network that uses dynamic k-max pooling, a global pooling operation over linear sequences for sentiment classification of movie reviews and Twitter.Kim [31] reports a series of experiments with four different convolutional neural networks for sentence-level sentiment classification tasks using pretrained word vectors.Kim achieves good performance results on different datasets and suggests that the pretrained word vectors are universal and can be used for various classification tasks.In [32], a new deep convolutional neural network is proposed to utilize from character-level to sentence-level information to implement sentiment classification of short texts.In [33,34], text classification is realized with a convolutional neural network that accepts a sequence of encoded characters as input rather than words.It is shown that character-level coding is an effective method.Johnson and Zhang [35] propose a semisupervised convolutional network for text classification that learns embeddings of small text regions.Joulin et al. [29] proposes a simple and efficient baseline classifier that performs as well as deep learning classifiers in terms of accuracy and runs faster.Conneau et al. [36] present a new architecture called very deep (VDCNN) for text processing which operates directly at the character level and uses only small convolutions and pooling operations.Kowsari et al. [37] introduce a new approach to hierarchical document classification, called HDLTex that employs multiple deep learning approaches to produce hierarchical classifications.

Ensemble Learning, Word Embeddings, and Representations
This section gives a summary of learning algorithms, ensemble integration techniques, word embeddings, and document representation methods used in this study.
The probability of term w t in class c j is as follows [38]: 3.1.2.SVM.Support vector machine (SVM) is binary classifier that divides an n-dimensional space with n features into two regions related with two classes [39].The n-dimensional hyperplane separates two regions in a way that the hyperplane has the largest distance from training vectors of two classes called support vectors.SVM can also be used for a nonlinear classification using a method called the kernel trick that implicitly maps input instances into highdimensional feature spaces that can be separated linearly.
In SVM, the use of different kernel functions enables the construction of a set of diverse classifiers with different decision boundaries.Max pooling is the most common type of pooling function that takes the largest value in a specified neighbourhood window.CNN architectures consist of a series of convolutional layers interleaved with pooling layers, followed by a number of fully connected layers.
CNN is originally applied on image processing and recognition tasks where input data is in a two-dimensional (2D) structure.On image processing, CNN exploits spatial local correlation of 2D images, learns, and responds to small regions of 2D images.In natural language processing and text classification, words in a text or document are converted into a vector of numbers that has one-dimensional (1D) structure.On text processing applications, CNN can exploit the 1D structure of data (word order) by learning small text regions of data in addition to learning from a bag of independent word vectors that represents an entire document [44].

Ensemble Integration Strategies.
To combine decisions of individual base leaners MVNB, MNB, SVM, RF, and CNN, majority voting and stacking methods are used in this study.In majority voting method, an unlabelled instance is classified according to the class that obtains the highest number of votes from collection of base classifiers.In stacking method, also called stacked generalization, a metalevel classifier is used to combine the decision of base-level classifiers [45].The stacking method consists of two steps.In the first step, a set of base-level classifiers C 1 , C 2 , … , C n is generated from a sample training set S that consists of feature examples s i = x i , y i where x i is feature vector and y i is prediction (class label).A meta-dataset is constructed from the decisions of base-level classifiers.The meta-dataset contains an instance for predictions of classifiers in the original training dataset.The meta-dataset is in the form of m i = d i , y i where d i is the prediction of individual n base classifiers.The metadataset can also include both original training examples and decisions of base-level classifiers in the form of m i = x i , d i , y i to improve performance.After the generation of metadataset, a metalevel classifier is trained with meta-dataset and used to make predictions.In our study, the metadataset includes both original training examples and decisions of base-level classifiers.

Word Embeddings with Word2vec
. Word embeddings are an active research area that tries to discover better word representations of words in a document collection (corpus).The idea behind all of the word embeddings is to capture as much contextual, semantical, and syntactical information as possible from documents from a corpus.Word embeddings are a distributed representation of words where each word is represented as real-valued vectors in a predefined vector space.Distributed representation is based on the notion of distributional hypothesis in which words with similar meaning occur in similar contexts or textual vicinity.Distributed vector representation has proven to be useful in many natural language processing applications such as named entity recognition, word sense disambiguation, machine translation, and parsing [13].
Currently, Word2vec is the most popular word embedding technique proposed by Mikolov et al. [26,27].The Word2vec method learns vector representations of words from a training corpus by using neural networks.It maps the words that have similar meaning into vectors that will be close to each other in the embedded vector space.Word2vec offers a combination of two methods: CBOW (continuous bag of words) and skip-gram model.While the CBOW model predicts a word in a given context, the Skip-gram model predicts the context of a given word.Word2vec extracts continuous vector representations of words from usually very large datasets.While it is possible to generate your own vector representation model from a given corpus, many studies prefers to use pretrained models because of high computational power and training time required for large corpuses.The pretrained models have been found useful in many NLP applications.
3.4.Document Representation Methods.The effective representations of documents in a document collection have a significant role in the success performance of text processing applications.In many text classification applications, each document in a corpus is represented as a vector of real numbers.Elements of a vector usually correspond to terms (words) appearing in a document.The set of documents with their vector representation forms document-term matrix.The significance of each term in a document is computed by using different term weighting methods.Traditional term weighting methods include Boolean, term frequency, and TF-IDF weighting schemes.TF-IDF is the common weighting method used for text processing.In this representation, the term frequency for each word is multiplied by the inverse document frequency (IDF).This reduces the importance of common terms in the collection and also increases the influence of rare words which have relatively low frequencies.As explained above, word embedding is also another effective method to represent documents as a vector of numbers.
In this study, we use and compare the following document representation methods: (i) TF-IDF.A document vector consists of words appearing in a document weighted with TF-IDF scheme (ii) Avg-Word2vec.A document vector is obtained by taking average of all vectors of word embeddings appearing in a document by using pretrained models (iii) TF-IDF + Avg-Word2vec.A document vector includes both TF-IDF and Avg-Word2vec vectors

Experiments
In this study, we have evaluated the performance of an ensemble of eight heterogeneous base learners with two integration methods using three different document representation methods on eight different datasets.and English.Turkish is an agglutinative language in which words are composed of a sequence of morphemes (meaningful word elements).A single Turkish word can correspond to a sentence that can be expressed with several words in other languages.Turkish datasets include news articles from Milliyet, Hurriyet, 1150haber, and Aahaber datasets.English datasets consist of 20News-19997, 20News-18828, Mininews, and WebKB4.The properties of each dataset are explained below.
Milliyet dataset includes text from columns of Turkish newspaper Milliyet from years 2002 to 2011.It contains 9 categories and 1000 documents for each category.The categories of this dataset are café (cafe), dünya (world), ege (region), ekonomi (economy), güncel (current), siyaset (politics), spor (sports), Türkiye (Turkey), and yaşam (life).Hurriyet dataset includes news from 2010 to 2011 on Turkish newspaper Hurriyet.It contains six categories and 1000 documents for each category.Categories in this dataset are dünya (world), ekonomi (economy), güncel (current), spor (sports), siyaset (politics), and yaşam (life).1150haber dataset is obtained from a study done by Amasyalı and Beken [46].It consists of 1150 Turkish news texts in five classes (economy, magazine, health, politics, and sports) and 230 documents for each category.Aahaber, collected by Tantug [47], is a dataset that consists of newspaper articles broadcasted by Turkish National News Agency, Anadolu Agency.This dataset includes eight categories and 2500 documents for each category.Categories are Turkey, world, politics, economics, sports, education science, "culture and art", and "environment and health".
Milliyet and 1150haber include the writings of the column writers; therefore, they are longer and more formal.On the other hand, Hurriyet and Aahaber datasets contain traditional news articles.They are more irregular, much shorter than documents of the other datasets.
The 20 newsgroup is a popular English dataset used in many text classification and clustering experiments [48].The 20 newsgroup dataset is a collection of about 20,000 documents extracted from 20 different newsgroups.The data is almost evenly partitioned into 20 categories.We use three versions of newsgroup dataset.The first one is the original 20 newsgroup dataset that includes 19,997 documents.It is called as 20News-19997.The second one is named 20News-18828 with 18,828 documents, and it covers less number of documents than the original dataset.This dataset includes messages with only "From" and "Subject" headers with the removal of cross-post duplicate messages.The third one is a small subset of the original dataset composed of 100 postings per class, and it is called as mininewsgroups.The last dataset is called WebKB [49] which includes web pages gathered from computer science departments of different universities.These web pages are composed of seven categories (student, faculty, staff, course, project, department, and other) and contain approximately 8300 pages.Another version of WebKB is called WebKB4 where the number of categories is reduced to four.We use WebKB4 in our experiments.
Characteristics of the datasets without application of any preprocessing procedures are given in Table 1 where C is the number of classes, D is the number of documents, and V is the vocabulary size.We only filter infrequent terms whose document frequency is less than three.We do not apply any stemming or stop-word filtering in order to avoid any bias that can be introduced by stemming algorithms or stop-word lists.

Experiment Results
. We use repeated holdout method in our experiments.We randomly divide a dataset into two halves where 80% of data is used for training and 20% for testing.To get a reliable estimation, we repeat the holdout process 10 times and an overall accuracy is computed by taking averages of each iteration.Classification accuracies and computation times of algorithms on different datasets are presented below.
As a first step, we calculate accuracies of individual classifiers to compare and observe results that we obtain by using an ensemble of classifiers with different representation methods.As explained before, base classifiers include MVNB, MNB, SVM, RF, and CNN.Table 2 lists the accuracies of these classifiers on eight different Turkish and English datasets.The best accuracy is obtained for each dataset and is shown in boldface.As it can be seen from Table 2, there is no best algorithm that performs well on each dataset like in many machine language problems.It seems that RF and CNN generally produce better accuracies from the rest of the algorithms.The random forest (RF) is the best algorithm in our experiments.This might be due to ensemble strategy applied in RF algorithm that uses a set of decision trees for classification.CNN also performs well.That might be because of the use of different convolutional filters that work like an ensemble system that extracts different features from datasets.The order of average classification accuracies of single classifiers can be summarized as follows: RF > CNN > MNB > SVM > MVNB.
To construct a heterogeneous ensemble system, we use MVNB, MNB, SVM, RF, and CNN algorithms as base classifiers.The decisions of each of these base classifiers are combined with majority voting (MV) and stacking (STCK) integration methods as described in Section 3.2.Each dataset is represented with traditional TF-IDF weighting scheme.Table 3 demonstrates the classification accuracies of heterogeneous ensemble systems with majority voting (Heter-MV) and stacking (Heter-STCK) together 5 Complexity with each of the single classifiers.As it can be seen from Table 3, an ensemble system with stacking integration strategy always performs better than single classifiers and the ensemble model with majority integration strategy.On our previous study [50], we obtain classification accuracies 85.44 and 87.73 for Turkish Hurriyet and Aahaber datasets, respectively, represented with TF (term frequency) weighting by using an ensemble of classifiers without CNN.With the inclusion of CNN to our set of base learners, improved accuracies are obtained with 89.55 and 92.70 for Turkish Hurriyet and Aahaber datasets, respectively.
In text mining applications, different words appearing in a document collection form feature set where the number of features is usually expressed in thousands.High dimensionality of feature space is a problem in text classification when documents are represented with "bag of words" model.Word2vec can be used as a feature extraction technique to reduce the number of features.The average of Word2vec vectors of words is employed to represent documents.Given a document d represented with n words d = w 1 , w 2 , … , w n , words appearing a document are represented with Word2vec embedding vectors e w 1 , e w 2 , … , e w n by looking up the vector representation of a word from a pretrained embedding model.
Each document is represented by taking average of word embeddings as follows: Google has used Google News dataset that contains about 100 billion words to obtain pretrained vectors with the Word2vec skip-gram algorithm [26,51].The pretrained model includes word vectors for about 3 million words and phrases.Each vector has 300 dimensions or features.A pretrained Turkish Word2vec model is constructed with all Wikipedia articles written in Turkish [52].We use these pretrained models in English and Turkish to represent documents with 300 dimensions or features.
Table 4 shows classification accuracies of documents represented by average of Word2vec vectors, called as Avg-Word2vec, in this study.Classification accuracies of single classifiers and heterogeneous ensemble systems are given in Table 4. Ensemble method with stacking integration strategy (Heter-Stck) produces better outcomes than the ensemble method with majority voting integration strategy (Heter-MV).We observe that there is a slight decrease in the classification accuracies of documents represented with Avg-Word2vec on the six datasets (20News-18828, 20News-19997, Mini-news, 1150-haber, Milliyet, and Hurriyet) when compared with the classification accuracies (Heter-Stck) of documents represented with TF-IDF given in Table 3.On the contrary, there is a slight improvement on two datasets (WebKB4, Aahaber).Avg-Word2vec-based classification performance of the system is competitive although a much less number of features are employed.
In the last experiment, each document is represented by concatenation of TF-IDF and average of Word2vec vectors.Although the size of the feature set is increased, we observe better classification accuracies than the other representation  2, we obtain better classification accuracies on seven of the datasets except Milliyet in which accuracy is decreased from 94.09 to 93.9 (0.2%).It can be concluded that the inclusion of Avg-Word2vec vectors to TF-IDF vectors as additional features boosts classification success of the ensemble system.It is difficult to compare the performance of our results with other studies because of the lack of works with similar combinations of different datasets, representation models, and ensemble approaches.Although a comparison of baseline classifiers and heterogeneous ensemble systems with different representation techniques is given in this study, we also report the classification accuracies of a number of research works here.In [53], Pappagari et al. propose an end-to-end multiscale CNN framework for topic identification by employing 20 newsgroups and Fisher datasets.The classification success of our proposed system outperforms with 94.3% while the accuracy performance of the work [53] presents 86.1% classification accuracy.
In another study [54], a graph-based framework is presented by utilizing SVM for text classification problem.Two data representation models (TF-IDF and Word2vec) and datasets (20 newsgroups and WebKB) are studied in this work [54].20 newsgroup dataset with TF-IDF representation exhibits 83.0% classification accuracy while the Word2vec version of the same dataset presents 75.8% classification success in [54].The TF-IDF representation model of our study performs 91.5% and the Word2vec representation of our work achieves 88.5% classification accuracies for 20 newsgroup dataset.Moreover, their study [54] with WebKB dataset and TF-IDF representation exhibits 89.9% classification accuracy while the Word2vec version of the same dataset presents 86.6% classification success.The TF-IDF representation model of our work performs 91.2%, and the Word2vec representation of ours represents 91.7% classification accuracies for WebKB dataset.In [55], Zheng et al. propose a bidirectional hierarchy skip-gram model to mine topic information within given texts.They use CNN and SVM as classification algorithms and 20 newsgroups and WebKB as datasets like in our study.Our proposed system performs 91.9% classification success with CNN algorithm while their CNN implementation exhibits 79.2% for 20 newsgroups in [55].The proposed system of our study performs 91.5% classification success with SVM algorithm while SVM exhibits 85.9% for 20 newsgroups in [55].For the WebKB dataset, CNN implementation of our work performs 93.0% classification success while their CNN method exhibits 91.8%.With SVM algorithm on WebKB dataset, we obtain 92.1% classification success while their SVM implementation exhibits 88.1%.The proposed heterogeneous ensemble systems given in this study always perform well compared with other studies reported above in terms of classification accuracies.

Conclusion
In this paper, we focus to enhance the overall accuracy of a text classification system by using ensemble learning, deep 7 Complexity learning, and effective document representation methods.It is known that the classifier ensembles boost the overall classification performance by depending on two factors, namely, individual success of the base learners and the diversity of them.The different learning algorithms, namely, variants of two naïve Bayes (MVNB and MNB), support vector machine (SVM), random forest (RF), and currently popular convolutional neural networks (CNNs), are chosen to provide diversity for the ensemble system.The majority voting and stacking ensemble integration strategies are performed to consolidate the final decision of the ensemble system.Word embeddings are also utilized to raise the overall accuracy of text classification.Word embeddings can capture contextual, semantical, and syntactical information in a textual vicinity from documents from a corpus.
A set of experiments is performed on eight different datasets represented with different methods using an ensemble of classifiers MVNB, MNB, SVM, RF, and CNN.As a result of experiments, some of the main findings of this study are as follows: In the future, we plan to use different pretrained word embedding models, document representation methods using word embeddings, and other deep learning algorithms for text classification and natural language processing tasks.

4. 1 .
Datasets.We use eight different datasets with different sizes and properties to explore the classification performances of the heterogeneous classifier ensembles in Turkish 4 Complexity (i) RF and CNN are the best performing single classifiers among others.The order of classification accuracies of single classifiers can be summarized as RF > CNN > MNB > SVM > MVNB (ii) An ensemble of classifiers increases the classification accuracies of texts on different datasets that have different characteristics and distributions (iii) A set of heterogeneous ensemble of classifiers can provide slight performance increases in terms of accuracy when compared with homogenous ensemble of classifiers [12, 50] (iv) Stacking is a better ensemble integration strategy than majority voting (v) The inclusion of state-of-art deep learning CNN classifier to the set of classifiers of an ensemble system can provide further enhancement (vi) The use of pretrained word embeddings is an effective method to represent documents.It can be a good feature reduction method without losing much in terms of classification accuracy (vii) Inclusion of word embeddings to TF-IDF weighted vectors as additional features provides a further improvement in text classification because word embeddings can capture contextual, semantical, and syntactical information from text Complexityinformation.The pooling layer shortens the training time and reduces the dimensionality of data and overfitting.
[40]3.RF.Random forests (RF) are a collection of decision tree classifiers introduced by Breiman[40].It is a particular implementation of bagging in which decision trees are used as base classifiers.Given a standard training set, the bagging method generates a new training set by sampling with replacement for each of base classifiers.In standard decision trees, each node of the tree is split by the best feature among all other features using splitting criteria.To provide randomness at feature space, random forest algorithm first selects a random subset of features and then decides on the best split among the randomly selected subset of features.Random forests are strong against overfitting because of randomness applied in both sample and feature spaces.The values of filters are learned during the training process.Convolution operation captures local dependencies or semantics in the regions of original data.An additional activation function (usually RELU (rectified linear unit)) is applied to feature maps to add nonlinearity to CNN.After convolution, a pooling layer reduces the number of samples in each feature map and retains the most important 3

Table 1 :
Number of classes ( C ), documents ( D ), and words ( V ) in each document.

Table 2 :
Classification accuracies of single classifiers on datasets represented with TF-IDF weighting scheme.

Table 3 :
The list of classification accuracies of single classifiers and heterogeneous ensemble systems with majority voting and stacking integration strategies on datasets represented with TF-IDF weighting.

Table 4 :
The list of classification accuracies of single classifiers and heterogeneous ensemble systems with majority voting and stacking integration strategies on datasets represented with the average of Word2vec vectors.

Table 5 :
The list of classification accuracies of single classifiers and heterogeneous ensemble systems with majority voting and stacking integration strategies on datasets represented with TF-IDF and the average of Word2vec vectors.