A Compressive Sensing Model for Speeding Up Text Classification

Text classification plays an important role in various applications of big data by automatically classifying massive text documents. However, high dimensionality and sparsity of text features have presented a challenge to efficient classification. In this paper, we propose a compressive sensing- (CS-) based model to speed up text classification. Using CS to reduce the size of feature space, our model has a low time and space complexity while training a text classifier, and the restricted isometry property (RIP) of CS ensures that pairwise distances between text features can be well preserved in the process of dimensionality reduction. In particular, by structural random matrices (SRMs), CS is free from computation and memory limitations in the construction of random projections. Experimental results demonstrate that CS effectively accelerates the text classification while hardly causing any accuracy loss.


Introduction
With the advancement of information technology over the last decade, digital resources have penetrated into all fields in our society, generating big data, which present a new challenge to data mining and information retrieval [1]. Texts are very common in daily life, and, with their large numbers, it remains an open question to organize and manage them [2]. As one of the fundamental techniques in natural language processing (NLP), text classification means assigning labels or categories to texts according to the content, and it is key to solving the problem of text overloads [3]. In its broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection, text classification provides support for the efficient query and search of texts, attracting a lot of attention from both academia and industry [4,5].
Word matching (WM), the simplest method in text classification, determines the category of a text by the categories of most words in the text [6]. But, due to the ambiguity of word meaning, WM fails to provide satisfying accuracy. By representing words as vectors, the vector space model (VSM) [7] improves the accuracy of text classification, thus replacing WM as the popular method, but the model requires many rules and great efforts from professionals in labeling texts, which would be a lot of cost. As machine learning (ML) [8] continues to develop, the accuracy of text classification has been further improved. By extracting features from a text to train a classifier, ML reforms VSM and avoids the rule-based inference. Recently, the rapidly developing deep learning (DL) [9], which is a branch of ML, has made text classification more efficient. However, high dimensionality and sparsity of text features pose a challenge to ML, restricting the practical use of MLbased text classification.
In ML, many classifiers can be used to classify texts, such as support vector machine (SVM) [10], decision tree [11], adaptive boosting (AdaBoost) [12], K-nearest neighbor (KNN) [13], and Naïve Bayes [14]. To train these classifiers, texts must be represented as feature vectors by some feature extraction models, among which the commonest is Bag of Words (BOW) [15]. BOW uses the term frequencies of n-grams in the vocabulary constructed by N-Gram [16] to encode every text. Because vocabulary may potentially run into millions, BOW faces the curse of dimensionality; that is, it produces a sparse representation with a huge dimensionality, resulting in the impracticality of training classifiers. erefore, dimensionality reduction (DR) is used to reduce the size of feature space. In DR, the most common techniques still introduce some time and memory complexity due to their nature of supervised learning, including principal component analysis (PCA) [17], independent component analysis (ICA) [18], and nonnegative matrix factorization (NMF) [19]. Many DL networks use autoencoder to compress the size of parameters. An autoencoder is a neural network that is trained to attempt to copy its input. Some popular architectures include sparse autoencoder [20], denoising autoencoder [21], and variational autoencoder [22]. Internally, they have a hidden layer that describes a code used to represent the input. By being embedded into the neural network, the autoencoder can end up learning a low-dimensional representation very similar to PCAs.
Compared with the above-mentioned DR techniques, random projection [23,24] is a better choice, since it avoids the model training, but it is still a challenge to store random projections due to the huge dimensionality of text feature. Compressive sensing (CS) [25][26][27], which has recently been rapidly developing, can be regarded as a random projection technique specially for sparse vectors, and it proves that the perfect recovery of sparse vector can be realized by several random projections. CS retains the advantages of random projection in DR and further overcomes the problem of memory with the help of structural random matrices (SRMs) [28,29], which makes CS a potential DR technique for text classification. In view of the merits of CS, we use it to speed up the training of text classifiers in this paper. For a low time and memory complexity, SRMs are selected as CS measurement matrices to reduce the size of sparse feature vector. Experimental results demonstrate that CS effectively accelerates the text classification while hardly causing any accuracy loss. e rest of this paper is organized as follows. Section 2 briefly reviews text classification and CS theory. Section 3 describes the CS model for text classification in detail. Section 4 presents experimental results, and finally Section 5 concludes this paper.

Text Classification.
Given a text dataset D � {d 1 , d 2 , . . ., d L } of L documents and a set C � {c 1 , c 2 , . . ., c J } of J predefined categories, the goal of text classification is to learn a mapping f from inputs d i ∈ D to outputs c j ∈ C. If J � 2, it is called binary classification; if J > 2, it is called multiclass classification. e mapping f is called the classifier, and it is trained by being fed with a labeled dataset, where each document in D has been assigned a category from C by professionals in advance. e trained classifier f is used to make predictions on new documents which are not included in D. Because of the subjectivity of text labeling, a test dataset is still needed to evaluate the prediction accuracy of f.
A typical flow of text classification is illustrated in Figure 1. In text preprocessing, we tokenize each document in D, erase punctuations, and remove unnecessary words such as stop words, misspelling, and slang. To reduce the size of vocabulary from D, some operations, e.g., capitalization, lemmatization, and stemming, can also be added. After text preprocessing, feature extraction is performed to represent documents in D as feature vectors, which is a crucial step for the accuracy and complexity of text classification. By N-Gram, we collect n-grams from D as the vocabulary of BOW model. It is very common to use unigram and bigram, where unigram is a single word and bigram is a word pair. Each document in D is encoded as a feature vector based on the frequency distribution of its n-grams on the BOW vocabulary. e size of feature vector is the same as that of BOW vocabulary, resulting in the huge dimensionality of feature space. By using DR techniques, dimensionality can be significantly decreased, reducing the time complexity and memory consumption when training the classifier. e feature vector of a document is also highly sparse because the number of its n-grams is far smaller than the size of BOW vocabulary. e high sparsity makes it possible to realize DR by CS without the loss of classification accuracy. Compared with the traditional DR methods, CS not only avoids the computations invested in supervised learning but also reduces the memory burden for constructing random projections. In this paper, we use CS to reduce the feature dimensionality and try to prove its efficiency of speeding up text classification.

Compressive
Sensing. CS is a novel sampling paradigm that goes against the traditional Nyquist/Shannon theorem, and it shows that a signal can be recovered precisely from only a small set of samples. e success of CS relies on two principles: sparsity and incoherence, where the former defines an S-sparse signal s in R N with all but the S entries set to be zero, and the latter highlights the incoherent measure vectors ϕ i ∈ R N M i�1 with s. e following briefly describes the CS framework.
By ordering these measure vectors in column, a measurement matrix Φ ∈ R M×N is constructed as follows:  2 Computational Intelligence and Neuroscience By using Φ to linearly measure s, we obtain the sampled vector y ∈ R M by y � Φ · s. (2) We define the ratio of M/N as the subrate R; that is, R � M/N, and DR is realized by setting R to be less than 1, but it also becomes an ill-posed problem to find s from y. Based on the sparsity property of s, this problem can be solved by an optimizing model: where ||·|| 0 represents l 0 norm to count the number of nonzero entries in s, and the solution s is an estimate of s. e incoherence between φ i and s has an effect on the convergence of the solution s to the original s, which presents a challenge for CS, that is, how to construct incoherence measurement vectors. Fortunately, it is found that random vectors are largely incoherent with any fixed signal, so Φ can be produced by some random distributions, for example, Gaussian, Bernoulli, and uniform.
By performing incoherent measuring with random matrices, CS can be categorized as the random projection technology in DR. In particular, in order to enhance the robustness of recovery, CS requires Φ to further hold the restricted isometry property (RIP) for S-sparse signals. When RIP holds, Φ preserves the approximate Euclidean length of S-sparse signals, which implies that all pairwise distances between S-sparse signals can be well preserved in the measurement space. In text classification, the feature vectors of documents in text dataset are highly sparse, so RIP of CS can significantly reduce feature dimensionality while preserving pairwise distances between feature vectors. Superior to traditional DR methods, CS ensures less memory consumption and faster computing by SRMs. In view of the merits of CS, we explore CS features extracted by SRMs to speed up text classification. Figure 2 presents the framework of the proposed CS-based text classification. After text preprocessing, the text dataset is divided into training dataset P and testing dataset Q, where the former is used to train classifiers, and the latter is used to evaluate the classification accuracy. e core of our work is to extract CS features to represent documents in text dataset. In CS feature extraction, we represent each document p i in the training dataset P as the highly sparse vector x i by BOW and construct an SRM Φ ∈ R M×N to linearly measure x i , producing the CS feature vector y i of x i . CS feature is a low-dimensional and dense vector, which can shorten the time of training classifier, especially for a large-scale text dataset. In the following parts, we describe, respectively, CS feature extraction, SRMs construction, and classifiers in detail.

CS Feature Extraction.
We collect unigrams and bigrams from the training dataset P to create the vocabulary of BOW model. Unigrams are single words from P, and most of them occur very few times to impact classification, so we only add top N 1 words from these unigrams to the BOW vocabulary. Bigrams are word pairs from P, and they are a good way to model negation like "not good." e total amount of bigrams is very big, but most of them are noise at the end of frequency spectrum, so we use top N 2 word pairs from these bigrams, adding them to the BOW vocabulary. In the experiment part, we set suitable N 1 and N 2 for different classification tasks.
After collecting unigrams and bigrams, we convert each document p i in P into the feature vector x i in sparse representation. e BOW feature x i is the frequency distribution of p i on the BOW vocabulary, and its size is N, which is the sum of N 1 and N 2 . All BOW features consist of a feature matrix X as follows: where L 1 is the amount of P. In the ordinary classification, X is input into the classifier to train it. Being a large size, X results in the curse of dimensionality; for example, when N and L 1 are set to be 25000 and 800000, respectively, the size of X is 25000 × 800000, and it needs a memory of 8 × 10 10 bytes (≈75 GB) assuming that 4 bytes encode each entry in X. at would lead to a heavy computational burden, so we reduce the size of X by CS measuring as follows: where Φ ∈ R M×N is a CS measurement matrix and Y ∈ R M×L1 is the CS feature matrix, of which the i-th column y i is the CS feature vector of the i-th document p i in the training dataset P.
To precisely recover signals, the CS measurement matrix is required to hold RIP. In practice, a random matrix, e.g., produced by Gaussian or Bernoulli distribution, obeys RIP for S-sparse signal provided that is satisfied [30]. M can be set to be far smaller than N since BOW features are highly sparse, so the size of Y can be significantly reduced. Importantly, RIP can be enforced or degraded by widening or reducing the gap between M and S; that is, when M is far larger than 4·S, the pairwise distances between S-sparse signals are well preserved in the CS feature Computational Intelligence and Neuroscience space, and these pairwise distances can be destroyed when gradually reducing M, so the subrate R becomes a key factor impacting the accuracy of text classification. In the experiment part, we will evaluate the effects of different R values on pairwise distances between features and the accuracy of classification. In general, these random projections are dense, and a common computer does not have sufficient memory to store them, so CS-based DR is not applicable to a large-scale dataset if traditional method is used to produce the random projections. However, CS offers some measurement matrices for large-scale and real-time applications, among which the most famous is SRMs. e following describes how to construct SRMs, so as to make CS-based DR feasible for a large-scale dataset.

SRMs
Construction. SRM, proposed by Do et al. [28], is a known sensing framework in the field of CS. With its fast and efficient implementation, it brings some benefits to CSbased DR, for example, low complexity, fast computation, block-based processing support, and optimal incoherence. By using SRMs, with less memory consumption, the length of BOW feature can be fast and greatly reduced while holding RIP.
SRM is defined as a product of three matrices; that is, where E ∈ R N×N is a random permutation matrix that uniformly permutes the locations of vector entries globally, F ∈ R N×N is an orthonormal matrix constructed by popular fast computable transform, e.g., Fast Fourier Transform (FFT), Discrete Cosine Transform (DCT), Walsh-Hadamard Transform (WHT), or their block diagonal versions, D ∈ R M×N is a random subset of M rows of the identity matrix of N × N in size to subsample the input vector, and ���� � N/M √ is a scale to normalize the transform so that the energy of the subsampled vector is almost similar to that of the input vector. By plugging (7) into (5), the matrix product Φ·X can be performed according to a sensing algorithm as shown in Algorithm 1.
e SRM sensing algorithm can be computed fast; that is, the computational complexity is typically in the order of O(N) to O(NlogN). Suppose that F is FFT or DCT matrix; the implementation of SRM takes O(NlogN) operations. SRM is used to measure L 1 BOW features one by one, which takes O(L 1 NlogN) operations; that is, the total computational complexity of the proposed CS model is O(L 1 NlogN). Compared with existing random projection techniques, SRMs not only cost less time and space complexity, but they also convert the sampled vector into a white noise-like one by scrambling the vector structure to achieve universal incoherence. erefore, SRMs can make CS-based text classification more efficient.

Classifiers.
Many popular classifiers can be used in our model, e.g., SVM, decision tree, AdaBoost, KNN, and Naïve Bayes. In the experiment part, these classifiers are applied and their classification accuracy is evaluated to verify the efficiency of our model. is section reviews these popular classifiers in text classification.
SVM [10] is a nonprobabilistic linear binary classifier. For a training set of points (y i , l i ), where y i is the CS feature vector and l i is the category of the document d i , we try to find the maximum-margin hyperplane that divides the points with l i � 1 and l i � -1. e equation of the hyperplane is as follows:   Computational Intelligence and Neuroscience We maximize the margin, denoted by c, as to separate the points well. By error-correcting output codes (ECOC) model [31], SVM can also undertake multiclass classification tasks. Decision tree [11] is a classifier model in which each node of the tree represents a test on the attribute of the data set, its children represent the outcomes, and the leaf nodes represent the final categories of the data points. e training dataset is used to form the decision tree, and the best decision has to be made for each node in the tree. e decision tree can be fast trained, but it is also extremely sensitive to small perturbations in the dataset and can be easily overfit. By cross validation and pruning, these effects can be suppressed.
AdaBoost [12] extracts a classifier from the set of weak classifiers at each iteration and assigns a weight to the classifier according to its relevance. e weight in AdaBoost for each sample is measured according to how difficult previous classifiers have found it to get it correct. At each iteration, a new classifier is trained on the training dataset, and the weights are modified based on how successfully the training sample has been classified before. Training terminates after several iterations or when all training samples are classified correctly.
KNN [13] is a nonparametric technique used for classification. Given the CS feature y i , KNN finds the K-nearest neighbors of y i among all CS features in the training dataset and gives the category candidate a score based on the labels of the K neighbors. e similarity between y i and its neighbor can be the score of the category of the neighbor features. After sorting the score values, KNN decides which category the candidate falls into with the highest score from y i . KNN is easy to implement and adapts to any kind of feature space. It can also handle multiclass cases. e performance of KNN depends on finding some meaningful distance functions, and it is limited by data storage when finding the nearest neighbors for large search problems.
Naïve Bayes [14] has been widely used for text classification, and it is a generative model based on Bayes theorem.
is model assumes that the value of a particular feature is independent of the value of any other feature. e proposed CS model is on the assumption that any entry in a CS feature vector is independent of other entries. Given a tobe-tested CS feature y, its category is predicted as follows: According to Bayes inference, we see that where y m is the m-th entry in the CS feature y. e probabilities p(l) and p(y m |l) can be estimated by maximum likelihood on the training dataset.

Dataset and Setting.
We conduct experiments on two datasets, one for a binary classification task and the other for a multiclass classification task. For the binary classification task, we use the Twitter sentiment dataset, which was crawled and labeled positive or negative. For the multiclass classification task, we use the weather report dataset that contains a text description and category labels for each event including thunderstorm wind, hail, flash flood, high wind, and winter weather. e classes of two datasets are imbalanced, especially for weather report dataset. To avoid the effects of imbalance on classification accuracy, the two datasets are preprocessed to make their classes balanced; i.e., for Twitter sentiment dataset, we randomly remove some positive and negative observations and make each class having 10000 observations; for weather report dataset, we delete the classes with few observations, and 9 classes remain: thunderstorm wind, hail, flash flood, high wind, winter weather, Marine understorm Wind, Winter Storm, Heavy Rain, and Flood, among which one has 1000 observations. Figure 3 presents the statistics of Twitter sentiment dataset and weather report dataset after balancing. For any dataset, 20% of observations in each class are set aside at random for testing. In feature extraction, we first do some preprocessing on documents in two datasets including the following: (1) tokenize the documents; (2) lemmatize the words; (3) erase punctuation; (4) remove a list of stop words such as "and," "of,", and "the"; (5) remove words with 2 or fewer characters; (6) remove words with 15 or more characters. en, for both datasets, we, respectively, collect the Computational Intelligence and Neuroscience top 8000 unigrams and 10000 bigrams from the training set to construct the BOW vocabulary, i.e., N 1 � 8000 and N 2 � 10000, and represent each training observation as the BOW feature vector with length of N being 18000. Finally, by setting different subrates, the SRMs are used to measure the BOW feature vectors, and the corresponding CS feature vectors are produced. We train different classifiers on the BOW-based and CS-based training sets, respectively, tune parameters by cross validation, and evaluate these classifiers on the test sets. Due to the random partition of dataset, the training and testing are repeated five times, and the mean testing accuracy is used as the evaluation metrics. e experimental settings are as follows. To evaluate the effects of different SRMs on feature distance and classification accuracy, we construct five SRMs by using transform matrices F including DCT, FFT, Block DCT, Block WHT, and Block Gaussian, in which the latter three are block diagonal matrices, of which the diagonal elements are DCT and WHTand Gaussian matrices with the size of 32 × 32. We use six classifiers including SVM, decision tree, AdaBoost, KNN, and Naïve Bayes to evaluate the classification accuracy of our model and compare the proposed CS model with the three DR methods: PCA [17], ICA [18], and NMF [19]. e subrate R is set to be between 0.1 and 0.6, and it is preset parameter, which is used to decide the length of CS feature vector. All of the experiments are conducted under the following computer configuration: Intel(R) Core (TM) i7 @3. 30

Effects of SRMs.
Feature distance measures the similarity between any two documents, which has a significant impact on training accuracy. If the features output by DR can well preserve their pairwise distances in original space, DR suppresses the loss of training accuracy; therefore, we evaluate the effects of SRMs on pairwise distances between text features. In the training set P, the average distance between the i-th BOW or CS feature and others is computed as follows: where x i and y i are, respectively, the i-th BOW and CS feature vector in P and L 1 is the amount of P. We select Block DCTas the core of SRM and use (12) and (13) to compute the average distance of each BOW and CS feature as shown in Figure 4. We can see that the tendencies of all distance curves are similar, and the curve of CS features trends closer to that of BOW features as the subrate increases, which indicates that the pairwise distances between BOW features correspond to those between CS features. To measure the distance differences between BOW and CS features, we compute the Mean Square Error (MSE) between the average distances of BOW and CS features as follows:  Table 1 presents the MSEs on multiclass classification dataset when using different subrates and SRMs. It can be seen from Table 1 that all SRMs provide similar MSEs at any subrate; e.g., the average MSE of each SRM at all subrates is about 11.00, and the MSEs of SRMs decrease as the subrate increases; e.g., the MSE of DCT is 18.78 at the subrate of 0.1, and it is reduced to 5.92 at the subrate of 0.6. ese MSE results indicate that SRMs can preserve the approximate pairwise distances between BOW features in the CS feature space.
en, we select SVM as the classifier in our model and evaluate the effects of SRMs on classification accuracy. With different SRMs, the accuracies of SVM classifier on binary and multiclass classification datasets are presented in Table 2. It can be seen that all SRMs provide similar accuracies in most cases at any subrate; e.g., with all subrates considered, the average accuracies of SRMs range from 0.7121 to 0.7203 on binary classification dataset, and similar results are obtained on multiclass classification dataset. We also see that the accuracy is gradually improved for any SRM as the subrate increases.
e above results indicate that the selection of SRMs has little impact on classification accuracy, and the subrate is a key factor in controlling the accuracy. erefore, any SRM can be used in our model, and we need to consider the balance between accuracy and subrate in practice.

Evaluation on Classifiers.
To verify the validity of CS, we have compared CS features and BOW features in terms of the accuracies and training time of different classifiers driven by them. e Block DCT is selected as SRM, and the accuracy results are presented in Table 3. It can be seen that, for binary classification, the accuracies of classifiers driven by the CS features go up with the increase of subrate. ough lower than those with BOW feature when the subrate is small, they quickly catch up; e.g., for SVM, the CS feature overtakes the BOW feature when the subrate is 0.3 and outperforms it thereafter. All the classifiers considered, the average accuracy by the CS features is also comparable with that by BOW feature. e same result can be obtained for multiclass classification. As for the training time in Figure 5, whether it is binary or multiclass classification, the CS feature costs far less than the BOW feature, especially when the subrate is small. Table 4 presents average accuracy, precision, recall, and F 1 on all classifiers for binary classification dataset. It can be seen that the precision, recall, and F 1 by CS features at any subrate are similar to those by BOW features, which indicates that the classification accuracy is reliable for CS features. From the above results, it can be

Conclusion
In this paper, we develop a CS-based model for text classification tasks. Traditionally, the BOW features are extracted from the text dataset, and they are the highly sparse representations with a huge dimensionality. It costs a lot to train classifiers by using BOW features. By using the incoherent measuring of CS, we greatly reduce the dimensionality of BOW features, and, at the same time, the RIP of CS ensures that the pairwise distances between BOW features are well preserved in a low-dimensional CS feature space. CS also provides the SRMs that are fast computable with low memory consumption. In the proposed model, different SRMs are constructed to linearly measure BOW features at a preset subrate, generating the CS features that are used to train the classifiers. Experimental results show that the proposed CS model provides a comparable classification accuracy with the traditional BOW model and

Conflicts of Interest
e authors declare no conflicts of interest.