Joint Character-Level Convolutional and Generative Adversarial Networks for Text Classification

With the continuous renewal of text classification rules, text classifiers need more powerful generalization ability to process the datasets with new text categories or small training samples. In this paper, we propose a text classification framework under insufficient training sample conditions. In the framework, we first quantify the texts by a character-level convolutional neural network and input the textual features into an adversarial network and a classifier, respectively. ,en, we use the real textual features to train a generator and a discriminator so as to make the distribution of generated data consistent with that of real data. Finally, the classifier is cooperatively trained by real data and generated data. Extensive experimental validation on four public datasets demonstrates that our method significantly performs better than the comparative methods.


Introduction
Machine-learning models have achieved remarkable results in computer vision (CV), automatic speech recognition, and neural language processing (NLP). With the development of artificial neural networks, text classification becomes one of the most intriguing fields of NLP [1,2]. Text classification refers to the division of texts in the corpus into predefined categories based on its contents and other attributes. It is widely used in many applications [3][4][5][6][7][8][9], such as spam filtering, news categorization, sentiment analysis, and digital library.
As typical supervised learning (SL), text classification requires abundant manually labeled samples for training. However, manually adding ground truth to vast amounts of texts is difficult to achieve in practical application, so the number of labeled samples is not enough to meet the requirements. For small training samples, the insufficient depiction leads to the poor generalization ability of classifiers obtained from learning.
To improve the generalization ability of classifiers, the feasible solution is to improve learning algorithms or increase training samples. e method worth mentioning is support vector machines (SVM), which improves the generalization ability on unknown samples by learning the hyperplane of the maximum interval between different categories. SVM alleviates the problem of small samples to some extent, but the high time complexity leads to the limitation of engineering applications, especially online information processing. Besides, the existence of massive unlabeled samples has gradually drawn the attention of scholars to semisupervised learning (SSL). Most of the semisupervised classification methods train the objective classifier through an initial classifier until it reaches the convergence condition. However, such methods have highcomputational complexity and may bring about large-scale sample problems.
Compared with the computationally expensive SSL, generative adversarial nets (GAN) proposed by Goodfellow et al. [10] provides an effective method for generating new data. In this paper, we employ GAN to generate the textual samples according to the data distribution of the input samples and label the generated samples. Furthermore, we use Char-level CNN to extract the text semantics and utilize the textual generative network to increase the training samples. To summarize, our contributions are listed as follows: (1) e novel structure of Char-level CNN is designed for small-scale datasets to generate textual features. By optimizing the configuration of the convolutional and pooling layers, the output comprehensively inherits the text semantics. Besides, the dense network is designed deeper because the generated data reduces the risk of its overfitting. (2) e data augmentation module based on high-level semantic is proposed. e text semantics are quantified and directly input into the network as real data to obtain various textual features. e module not only avoids the feature extraction of the generated texts and saves the computing resources but also describes the overall distribution characteristics of data to make the classifier perform better on small-scale datasets. e rest of this paper is structured as follows. Section 2 summarizes the related work of textual feature construction, text generation, and semisupervised learning. Section 3 details our model to solve the small-scale datasets text classification problem. Section 4 presents the experimental setting and results analysis. Section 5 concludes our whole work and gives further research direction.

Related Work
Compared with other media streams, the most special aspect of texts is that semantic information is difficult to express. For subtasks in NLP, how to quantify abstract semantic information is particularly critical. erefore, the final performance of text classification is jointly affected by the classification models and feature representation methods [11][12][13]. e mainstream representation methods for text classification can be roughly divided into three categories. Traditional textual features are mainly generated through methods such as bag of words (BOW) and n-gram. Both are usually combined with term frequency-inverse document frequency (TF-IDF) and other element features as textual features. However, the momentous drawback of these methods is that they ignore the context and order of words. e second language model is based on attention mechanisms, commonly known as hierarchical attention and selfattention, which extracts textual features by scoring input words or sentences differentially. In the third language model, text can be represented by sequence or structured models through the introduction of artificial neural networks. In addition to the above three language models, some pretraining models, such as XLNet and BERT, have also been proposed for NLP. However, these pretrained models require large amounts of labeled textual data and are not suitable for new categories and small samples of textual data.
In the text representation methods based on the sequence or structured models, Bengio et al. [14] first tried to use neural networks to produce dense, low-dimensional vectors for words. Mikolov [15] proposed a language model, called Word2Vec, which can transform each word into vector form according to the context. e model can take the representative words as the representation of text by working with clustering algorithms. Besides, textual features can be obtained by simply combining the word embeddings to replace sentences or texts, and the linear model FastText [16] is widely used in text classification in this way. Kim [17] applied the convolutional neural network to the classification process and obtained the textual features by processing the matrix formed by word embedding. Kaichbrenner et al. [18] proposed a convolutional architecture and dubbed the dynamic convolutional neural network (DCNN) for sentence modeling. Zhang et al. [19] proposed the use of convolutional networks for text classification at the character level, but their network structures only work well on large-scale datasets. en, for the structural configuration of convolutional neural networks, Le et al. [20] studied the importance of depth in convolutional models. At the same time, the recurrent neural networks are also applied to the language models due to their memorability and Turing completeness. Chung et al. [21] compared different types of recurrent units, especially gating mechanisms such as long short-term memory (LSTM) and gated recurrent unit (GRU). Zhu et al. [22] attempted to build structured representations using prespecified parsing trees. Recently, Zhang et al. [23] proposed a reinforcement learning (RL) method to get structured sentence vectors. Note that before the abovementioned methods were proposed, some traditional machine-learning algorithms were widely used in text classification. For example, k-nearest neighbor (KNN) for classification by measuring the distance between different features, decision tree combining information entropy and tree structure, and Naive Bayes based on Bayesian theory and characteristic conditional independence hypothesis. However, it is proved that the performance of machinelearning algorithms is lower than that of the methods based on deep learning in the text classification task.
To solve the problem of insufficient training samples, many methods of data augmentation and semisupervised learning can be used to improve the performance of classifiers. Wei and Zou [24] presented an easy data augmentation (EDA) technique for boosting performance on text classification tasks. Although EDA reduces overfitting when training on smaller datasets, the improvement is at times marginal. Wang and Wu [25] proposed a framework that combines variational autoencoder (VAE) and neural networks to deal with text classification and generation tasks. GAN [10,26] was firstly proposed for continuous data (image generation, inpainting, style transfer, etc.) and has shown excellent performance in computer vision (CV). Yu et al. [27] extended GAN to discrete and sequential data to alleviate the above deficiency. Since then, various text generation methods have been proposed via GAN. Xu et al. [28] proposed a text generation model called DP-GAN, which can encourage the generator to produce diverse and informative text. Li et al. [29] combined reinforcement learning, GAN, and recurrent neural networks to build a category sentence generative adversarial network. Miyato et al. [30] extended adversarial and virtual adversarial training to the text domain. Ahamad [31] also tried to solve the above problem by using Skip-ought sentence 2 Complexity embeddings in conjunction with GANs. Although the methods utilize text generation and feature reconstruction to alleviate the problem of small-scale datasets, the classification performance is still difficult to further improve due to the large work of feature engineering and the trouble of textual feature extraction.

Methodology
In the section, the character-level convolutional and generative adversarial networks (CCNN-GAN) are utilized to set up a novel hybrid text classification framework in Figure 1. In contrast to the implementation of continuous data, the textual data is modelled by convolutional networks and character quantization in our model. Char-level CNN embeds the texts in the corpus into fixed-length features, and then the features are input into the generative adversarial network (GAN) and backpropagation network (BP network), respectively. e generated data not only enriches the textual features but also effectively solves the problems of insufficient samples and single information when dealing with small-scale datasets. After that the real samples and generated samples from GAN are mixed into a BP network for training. e design is modular, and the text information is transmitted between modules in the form of processed features.

Text Quantization Module.
e acceptable encoding of Char-level CNN includes alphabetic encoding, utf-8 encoding, and pretrained character embedding vector. Since the proposed model is mainly used in the English-dominated alphabetic attachment language, alphabetic encoding is applied to the text quantization process. In the embedding layer, features of encoded characters are used as input. An alphabet of size α is stipulated; then, an embedded dictionary is created and an embedding matrix is formed based on the alphabet. Null character and characters that do not exist in the alphabet are replaced by an all-zero vector. e quantization length of the character feature is set to β. Assume [19] that β characters in the text can reflect the content of the text and the part that exceeds length β is ignored. e foundational alphabet (α � 45) and elaborate alphabet (α � 70 + α 0 ) are stipulated, where α 0 is the length of the auxiliary characters, and the details are shown in Table 1.
e foundational alphabet used in the proposed model consists of 45 characters which are 36 English letters and Arabic numerals, 8 other characters, and the null character. e alphabet is applicable to most documents. For the corpus with high symbol content, the foundational alphabet is supplemented by the elaborate alphabet. e elaborate alphabet consists of 25 symbol characters and auxiliary characters of varying lengths. Char-level CNN consists of 8 convolution layers, 3 pooling layers, and 3/4/5 fully connected layers. e configuration of the convolutional neural network is shown in Table 2 and Figure 2. e classifier that is composed of fully connected layers is discussed in detail in Section 3.3. e Char-level CNN is constructed to compute 1D convolution and the weights are initialized using Gaussian distribution. e mean and standard deviation to initialize the model is (0, 0.05). e pooling layers are also applied between the convolutional layers for increasing the area covered with the next receptive fields.
In addition, a rectified linear unit (ReLU) is taken as the activation function in the classifier, and local response normalization (LRN) [32] is added behind each pooling layer. e LRN imitates the biological neural system layer of lateral inhibition mechanism and improves the generalization ability of the model. e function is as follows: where P is the tensor obtained after pooling, i and j represent the ith and jth kernel, k, n, a, and b are hyperparameters, and N is the total number of kernels.

Data Augmentation
Module. Different from the typical RL setting, the data augmentation module enriches the predetermined corpus at the semantic level. Specifically, the module is an adversarial network, in which the generator can generate directly many textual features with the same distribution of the processed real-world texts and the discriminator takes the convolutional textual data as real data. us, given the processed textual features as input, the adversarial network can generate various generated features that contain diverse and informative text semantics.
To optimize the performance of the overall framework, the data augmentation module is simplified. e output of the module is not in the form of sentences or documents but in the form of textual features that contain semantic information. Based on the abovementioned ideas, the adversarial network no longer needs to connect with the structure that converts textual features into sentences or documents. During the network training, the processed features and its category are all input into the adversarial network so that it can output textual features that belong to each category.
More formally, the input data . , X m of m categories comes from the processed corpus Γ, and the high-level textual features is denoted as where x i,j refers to the jth textual feature of category i. e generative network outputs multiple categories of labeled textual features X * 1: 3 , . . . refers to the generated features of category i. e representation of corpus in Section 3.1 show that the data points x i,j are independent of each other and identically are taken from real-world distribution p data (x). e generative network is to learn a generator's distribution p g (x) that gradually approximates to p data (x). e value function of conditional GAN is defined as follows: where x is the real textual feature, y is the category label corresponding to feature x, and z is random noise. rough the different settings of y, the textual features of different categories can be obtained in the process of text generation. Note that GAN is designed for continuous data, that is, it can only be constructed by differentiable functions. In the correlation processing of discrete data, it is difficult to transfer the gradient of the discrete outputs to the discriminator, so the discriminator cannot be updated. ere are many feasible methods to solve the problem. is paper uses the policy gradient [27] to improve the application field of the adversarial network. e generator and discriminator are trained alternatively and the detail of the adversarial model is as follows.

3.2.1.
e Generator for Textual Data. Recurrent neural network (RNN) is designed to solve the vanishing and exploding gradient in backpropagation, and it is widely used in NLP because of the discrete distribution of textual data. e GRU is set as the generative network and the structure is shown in Figure 3. e update gate z t is used to control how much the previous state information is brought into the current state. e higher the value of the z t is, the more state information is brought at the previous moment. e function of the updated gate is as follows: where [h, x] is the vector concatenation. e reset gate r t controls how much information is written to the current candidate set h t from the previous state. e smaller the r t , the less information is written from the previous state. e function of the reset gate is as follows: Since each unit has its reset and update gates, each hidden unit learns dependencies on different scales. e units that learn to capture short-term dependencies activate where ⊙ is the elementwise product. We use random noise as input into the generator to construct the mapping of noise space to text semantic space.

e Discriminator for Data Screening.
Since both the real text and the generated text can be quantified as a feature of fixed length, a convolutional network is constructed to discriminate the source of the text. e textual feature x 1 , x 2 ,. . ., x n is processed as follows: where f(·) is a nonlinear function, x i is the l-dim textual features, ⊗ is the convolution operation, w ∈ R k is a 1D kernel to produce a new feature map, and b is a bias term. e various numbers of kernels with different window sizes are used to extract different features. Specifically, the textual feature extracted by the kernel w with window size k is represented as follows: Finally, the max-pooling operation is performed on the feature map s � max s { } and all pooling features from different kernels are transferred to a fully connected softmax layer to get the probability that a given feature is real. When optimizing discriminative models, supervised training is applied to minimize the crossentropy, and the objective function is as follows: where p(x) is the real label of the textual features and q(x) is the predicted probability from the discriminator.

Classifier
Module. e classifier constructed by dense networks is a multilayer-feedforward network based on error backward propagation algorithm. e principle is to calculate the difference between the actual output and the expected output recursively, and the network adjusts the weights according to the difference. e real features X 1:m and generated features X * 1:m are input to the network for training. Here, ReLU is used as the activation. To explore the optimal network structure, the dense network consisting of three to five fully connected layers is constructed, as shown in Figure 4. Besides, the dropout modules are inserted between the fully connected layers to prevent overfitting, and the dropout probability is 0.5. e real semantics X 1:m and generated semantics X * 1: m are input to the network for training. e activation function and its derivative are as follows:

Experiments
In this section, we detail the experimental content and related setting, involving the datasets, the baselines, parameter setting, and experimental result analysis. To effectively demonstrate the advantages of the proposed method, we not only compare different text classification methods but also make the comparison of influence factors of our method. In this way, we further study the feasibility of the method through comparative experiments.

Datasets.
We use four corpora to evaluate the proposed framework. ese datasets are AG-News: original AG-News has over one million news articles collected from over 2,000 different news sources. We extract the data from the four largest categories in the original dataset, including world, environment, sport, and business news, and each category contains 30,000 for training and 1,900 for testing. e dataset can be downloaded from http:// www.di.unipi.it/∼gulli/AG_corpus_of_news_articles. html. DBPedia [33]: DBPedia ontology dataset is composed of 14 nonoverlapping categories from Wikipedia. In the experiment, we adopt the updated corpus by Zhang et al. Each category contains 40,000 training samples and 5,000 testing samples. 20NG: 20 newsgroup dataset is one of the international standard datasets for text classification, text mining, and information retrieval research. e dataset collects about 20,000 newsgroup documents that are evenly divided into newsgroup collections of 20 topics. Some newsgroups are dedicated to similar subjects, and some are completely unrelated. e dataset can be downloaded from http://qwone.com/∼jason/20Newsgroups. IMDB: the dataset is widely used for binary sentiment classification of movie reviews. It provides 25,000 highly polar movie reviews for training, 25,000 for testing, and additional unlabeled data. e dataset comes from http://ai.stanford.edu/∼amaas/data/ sentiment.
Note that most open datasets for text classification consumed huge resources to artificially tag it. To simulate the real-world small-scale datasets, we randomly extract parts of these datasets in the experiment. e details of the datasets are shown in Table 3.

Baselines.
To make the experimental comparison more comprehensive and objective, we reproduce the mainstream text classification method, such as FastText, DPCNN, LEAM, and Virtual Adversarial. e details of these methods are as follows: Tree-LSTM [34]: the model proposed by Tai et al. extends the LSTM of sequence to the tree structure, that is, it can skip (or ignore) the whole subtree that has little effect on the result through the forgetting gate mechanism of LSTM, rather than just some subsequences that may have no linguistic significance. Self-Attentive [35]: a model for extracting an interpretable sentence embedding by introducing self-attention. e method uses a 2D matrix to represent the embedding and proposes a self-attention mechanism and a special regularization term for the model. Emb-CNN [17]: the model is a slight variant of the CNN architecture of Collobert et al. It shows that a simple CNN with little hyperparameter tuning and static vectors achieves excellent results on multiple benchmarks. Learning task-specific vectors through fine-tuning offers further gains in performance. Kim additionally proposes a simple modification to the architecture to allow for the use of both task-specific and static vectors. Char-CNN [19]: the method treats text as a kind of raw signal at character level and applies temporal ConvNets to it. e most important conclusion from the method is that character-level ConvNets can work for text classification without the need for word embedding.

Convolutional structure
Classifier structure  K is the number of categories, L is the average length of a document, and α is the extracting times of the sample sets.

Complexity
Char-CRNN [36]: a architecture that utilizes both convolution and recurrent layers to efficiently encode character inputs. Compared with character-level convolution-only models, it can achieve comparable performances with much fewer parameters. FastText [16]: the linear models with a rank constraint and fast loss approximation are often on par with deeplearning classifiers in terms of accuracy, and many orders of magnitude can be improved in evaluation. L-MIXED [37]: a training strategy, even a simple BiLSTM model with crossentropy loss, can achieve competitive results compared with more complex methods. In addition to crossentropy loss, by using a combination of entropy minimization, adversarial, and virtual adversarial losses for both labeled and unlabeled data, the method can also perform very well. DPCNN [38]: a low-complexity word-level deep convolutional neural network architecture for text classification that can efficiently represent long-range associations in text. Johnson et al. studied deepening of word-level CNNs to capture global representations of text and found a simple network architecture with which the best accuracy can be obtained by increasing the network depth without increasing computational cost by much. LEAM [39]: the method of considering text classification as a label-word joint embedding in which each label is embedded in the same space with the word vectors. It maintains the interpretability of word embedding and has a built-in ability to leverage alternative sources of information, in addition to input text sequences. Ad-Training [30]: the framework extends adversarial and virtual adversarial training to the text domain. e method applies perturbations to the word embedding in recurrent neural networks rather than to the original input itself. Text GCN [40]: the model is initialized with one-hot representation for word and document, and it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents.

Implementation Details.
e 5/10-fold crossvalidation is applied to each dataset. e reduced training sets are marked as a part-dataset part-i(i � 1, 2, 3, . . . , 8), where the amount of text in the dataset is, respectively, 200, 400, 800, 1,500, 2,500, 4,000, 6,000, and 10,000. To reduce the occasionality brought by sample selection, we use the same test set for multisize training sets. e total test sets are used to evaluate the performance on the 20NG dataset, and the other datasets are tested with 10,000 samples, respectively. Note that, on the IMDB, entire unlabeled samples are always provided for the semisupervised methods. e baseline used in this paper replicates and sets parameters basically according to the original literature. e special cases are as follows: L-MIXED has two objective functions, the crossentropy loss L ML is adopted on the supervised datasets, and the mix function L MIXED is adopted on the IMDB. e unsupervised embeddings obtained by tv-embedding training in DPCNN. Adversarial Training has several training strategies, the virtual adversarial method based on unidirectional LSTM is utilized on the supervised datasets, and the bidirectional LSTM with virtual adversarial training on the IMDB.
In the details of our method, the elaborate alphabet length is set to 25, the convolution operation is set to "valid" and the pooling operation is the max-pooling, β � 1454. e number of output units for the last layer is determined by the problem in the classifier, that is, for the DBPedia it is 14. Besides, the other fully connected layers, all have 4096 units. During the training process, CCNN-GAN without the data augmentation module is first trained. en, the parameters of the Char-level CNN are fixed and the data augmentation module is introduced to conduct incremental training of the classifier.

Results and Analysis.
We analyze in detail the effect of different model settings, including the size of the alphabet, the number of generated features, and the structure of the classifier network. Besides, we compare the performance of the proposed method with those of the representative methods on the benchmark datasets.

Model Parameter Analysis.
We use AG-News to analyze the impact of different settings. e alphabet size directly affects the time and efficiency of the classification method. We employ the foundational alphabet and the elaborate alphabet with the auxiliary symbols, respectively. As shown in Figure 5, the classification accuracy of the proposed method significantly improves with the growth of the dataset. When the scale increases to Part-6, the growth of the classification accuracy tends to be flat, which indicates that the dataset has a similar representational ability to the complete dataset.
However, the classification accuracy of the synthetic alphabet (the foundational alphabet and the elaborate alphabet) is similar to that of the foundational alphabet. Although the elaborate alphabet increases the representativeness of textual features, the improvement is negligible. Besides, with the alphabet increase, the network training time significantly increases.
To get the optimal network settings, we adjust the classifier structure and the amounts of samples by the generative network, respectively. Firstly, the number of fully connected layers and neurons in each layer is adjusted continuously to find the optimal configuration. en, we adjust the quantity of generated texts and analyze the impact of generated samples on the accuracy. We compare the effects of the abovementioned variables on AG-News. As shown in Figure 6, the increase of the fully connected layers improves the accuracy of the classifier, but its change is not obvious.
When the number of real-world samples is small, the proposed framework can improve the performance of the Complexity 7 classifier. However, when the number of generated samples continues to increase after reaching a certain level, the classifier is not significantly improved, which may be because the scale of original data limits the performance of the generative network, further limits the generation space of text, and finally affects the abstract semantic space of samples.

Comparison of Different Methods.
We test different methods on the datasets mentioned in Section 4.1. To improve the efficiency on the premise of better accuracy, the structure setting of CCNN-GAN is as follows: the alphabet is foundational, the number of fully connected layers is three, and the number of generated textual features is the same as the number of real samples. CCNN-GAN is superior to all the comparison methods when dealing with small-scale datasets, especially compared to the state-of-the-art methods, which indicates that the generated texts greatly optimize the training of the classifier. We can see classification methods that perform well on large-scale datasets lose their advantages on smallscale datasets such as L-MIXED and DPCNN. e reason is that fewer samples lead to the overfitting of the deeper network. e unlabeled data provide useful information for semisupervised learning, so the semisupervised methods show good results on IMDB. As shown in Figure 7, the classification accuracy of various semisupervised models is similar, our method performs better than these semisupervised models on small-scale datasets. Note that, on the Part-1 dataset, the classification accuracy of Adtraining is slightly higher than that of the proposed method, but it uses a large amount of unlabeled real data in the training process. Besides, CCNN-GAN has the optimal or suboptimal performance on datasets of various sizes, which indicates that the method has a strong generalization ability.
To observe the experimental effect more conveniently, we bold the optimal experimental data. As shown in Tables 4  and 5, experimental results indicate that the accuracy of the classifier improves with the increase of training data. When the dataset is small, our method has better performance than other methods, but its accuracy improves more slowly as the dataset size increases. e reason is that the generation of textual features further enriches the dataset, which indirectly expands the dataset and reduces the importance of the original dataset. Besides, the classification accuracy is close to a saturation state when the dataset expands to a certain extent.
Overall, the experimental results show that the classifier based on deep neural networks achieves excellent performance in multiclass text classification. Although the proposed method is less competitive than the state-of-the-art methods on some large-scale datasets, CCNN-GAN is better than all the comparison methods in various small-scale datasets. Besides, our method inherits the advantages of the previous character-level convolutional network and makes it easier to adapt to multiple languages by updating the alphabet freely.

Conclusion
In this paper, we propose a hybrid neural network framework for text classification. Our framework introduces generative networks to enrich corpus and utilizes a character-level convolutional network to extract latent semantic. Experimental results show that the performance of the framework on large-scale datasets outperforms other mainstream methods, and it performs significantly better than other methods on small-scale datasets. In the future, we intend to improve the output of the generative network and further enrich the generated text semantics.
Data Availability e data supporting this paper are from the reported studies and datasets in the cited references.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.