Gamma-Poisson Distribution Model for Text Categorization

We introduce a new model for describing word frequency distributions in documents for automatic text classification tasks. In the model, the gamma-Poisson probability distribution is used to achieve better text modeling.The framework of the modeling and its application to text categorization are demonstrated with practical techniques for parameter estimation and vector normalization. To investigate the efficiency of our model, text categorization experiments were performed on 20 Newsgroups, Reuters-21578, Industry Sector, and TechTC-100 datasets. The results show that the model allows performance comparable to that of the support vector machine and clearly exceeding that of the multinomial model and the Dirichlet-multinomial model.The time complexity of the proposed classifier and its advantage in practical applications are also discussed.


Introduction
The Poisson distribution is one of the most commonly used models for describing the number of random occurrences of a phenomenon in a specified unit of space or time.This means that if we want to model the number of discrete occurrences that take place during a given length, we should first check whether or not the Poisson distribution provides a good approximation.For text modeling, it is justified to adopt the Poisson distribution for describing the number of occurrences of a certain word in documents of fixed length when the independent assumption of each word occurrence holds in an approximate sense.It has been well established, however, that the Poisson model does not fit observed data [1,2].The reason for the failure of the Poisson model is that, for most words, the predicted variance, which is equal to the Poisson mean (the expected number of occurrences during the given interval), systematically underestimates the actual variance.Although this inadequate description of words distribution with the Poisson model can be used for key words selection in information retrieval [1] and for feature selection in text categorization [2][3][4] improvement in the accuracy of description is inevitably needed in order to build a highperformance text classifier using the model.
As has been proposed by Church and Gale [5], it is natural to extend the simple Poisson to a Poisson mixture in order to describe the observed variance in actual documents.Here, a Poisson mixture is a probability density function that is expressed as a sum of infinite Poisson distributions with a certain weighting function.The Poisson mixture is therefore considered to be a hierarchical model because a two-step process is required to generate a sample; to get a sample, we first pick a Poisson distribution with a certain probability according to the weighting function and then pick the sample from the chosen Poisson distribution.A reasonable choice of the weighting function is the conjugate prior of the Poisson, that is, the gamma distribution, because it greatly simplifies mathematical treatments [6], and this choice leads to the joint gamma-Poisson distribution.
In this paper, we focus on the use of the gamma-Poisson distribution to construct a new generative probabilistic text classifier.We believe that this is worthwhile for the following reasons.
(i) Several attempts to extend the original generative probabilistic classifier by using a conjugate prior for better text modeling have already been suggested; the reported conjugate prior-likelihood pairs are the Dirichlet-multinomial [7], the beta-negative binomial [8], and the beta-binomial [9].To the best of our knowledge, the gamma-Poisson distribution has not yet been used to construct a generative probabilistic text classifier.As mentioned above, since the model using the gamma-Poisson distribution can be regarded as one of the most fundamental and natural for modeling texts, it is useful to illustrate its framework.
(ii) It will be shown in a later section that our new classifier using the gamma-Poisson distribution clearly outperforms the original generative probabilistic classifier (multinomial naive Bayes) and is highly competitive with the support vector machine (SVM), which is the state of the art in terms of classification accuracy.This means that gamma-Poisson modeling is attractive not only for theoretical work but also for practical applications.
Note that the negative binomial distribution, which is the resultant posterior distribution of the gamma-Poisson pair [6], has been used for text modeling [5] and for text categorization [10].In those studies, however, the hierarchical structure of gamma-Poisson modeling of texts was not taken into account, and the negative binomial distribution was merely used as a simple tool for describing word distributions.In the present work, the proposed classifier is based on hierarchical modeling; in other words, parameter estimation and classification procedures directly reflect the hierarchical structure of the gamma-Poisson distribution.In this sense, our approach is fundamentally different from the previous works.
To demonstrate that the proposed modeling is useful in practical applications, the classification accuracy and the computation time of the algorithm are examined using four standard datasets.The results lead us to conclude that the classifier with the proposed modeling is the most suitable among several tested classifiers in the case of noisy datasets.Furthermore, the advantage of the proposed model in incremental learning tasks that are frequently needed in practical application will be reported in detail.
The rest of the paper is organized as follows.Section 2 summarizes related works on the improvement of the generative probabilistic classifier in which some conjugate priors are utilized for better text modeling.In Section 3, we describe the framework of gamma-Poisson modeling of texts and how to construct a classifier by using the framework.Section 4 describes our experiments on automatic text classification.We summarize these results in Section 5 and discuss the characteristics of gamma-Poisson modeling in Section 6.In the last section, we give our conclusions.

Related Work
We first give a brief overview of the text categorization problem and introduce notation for later use.For the sake of completeness and later reference, we then review the original generative probabilistic classifier and its descendants, in which conjugate priors are used for better text modeling.Except for the beta-negative binomial model, all the models described in this section will be used in corresponding classifiers, and their performance will be compared with that of our new classifier through experiments.

Text Categorization Problem.
Text classification/ categorization is defined as the task of classifying documents into a fixed number of predefined categories.Categories are also called classes.Let D denote a domain of possible text documents, and let  = { 1 ,  2 , . . .,  || } be a finite set of predefined categories.In the conventional text categorization setting used in this study, each document  ∈ D is assigned to a single category  ∈ .(Note that to simplify the problem, we consider here the categorization where each document is assigned to only one class.In other words, every document is assumed to be single labeled, not multilabeled.This assumption is made throughout this work.)We are given a set of training documents  = { 1 ,  2 , . . .,  || } which is a subset of D, that is,  ⊂ D. We assume that there is a target concept Φ : D →  that maps documents to categories.The result of the mapping Φ() is known for documents in the training set ; that is, each training document   ∈  has a class label   ∈ { 1 ,  2 , . . .,  || }, which indicates that document   belongs to a category corresponding to the label.In the training phase, we attempt to find the classification function  : D → , which approximates Φ from the information contained in training set .In the test phase,  is used to classify new documents, the labels of which are unknown.The most important objective is to find  that maximizes accuracy (i.e., the percentage of times  and Φ agree).
To obtain a good , the training set  must contain sample documents of all possible categories.If this is the case,  is expressed as where   denotes a set of training documents that belong to a class .

The Generative Probabilistic
Classifier.The generative probabilistic classifier, often referred to as the multinomial classifier or multinomial naive Bayes, is one of the most popular classifiers for text categorization because it sometimes achieves good performance in various tasks, and because it is simple enough to be practically implemented even with a great number of features [11,12].The simplicity is mainly due to the following two assumptions.First, an individual document is assumed to be represented as a vector of word counts (bag-of-words representation).Since this representation greatly simplifies further processing, all the descendants including our new classifier inherit this first assumption.
Next, documents are assumed to be generated by repeatedly drawing words from a fixed multinomial distribution for a given class, and word emissions are thus independent.
From the first assumption, documents can be represented as vectors of count-valued random variables.The th document in a considered class  is then expressed as where   is the count of the th term   in the th document in   and || is a vocabulary size; in other words, we have assumed here that the vocabulary of the considered dataset is given as  = { 1 ,  2 , . . .,  || } where   is the th word in the vocabulary.From the second assumption, the probability of the document   given by vector (2) is where   is the probability for the emission of   and is subject to the constraints ∑ || =1   = 1.Note that for text classification, the parameters   must be evaluated for each possible class .We use the estimator for   given by where where () is the prior probability of class  which is estimated from a training set by () = |  |/||.We estimate   in (5) by using (4) for each specified class .The document is assigned to the class with the highest probability ( | ).
The framework of the multinomial classifier described above usually works well in practical text classification tasks [11].It has been pointed out, however, that the multinomial distribution used in the multinomial naive Bayes cannot describe the word burstiness phenomena that are inherently encountered in natural language documents [7].Here, burstiness means the tendency of words in a document to appear in bursts; that is, if a word appears once, it is more likely to appear again.Since multinomial modeling is based on the assumption of independent word emissions with a fixed multinomial distribution for a considered class, the resultant word distribution fails to capture the burstiness especially for the words with moderate and low frequencies [7].

Dirichlet-Multinomial Model.
A substantial improvement to describe the word burstiness phenomena has been achieved by introducing the Dirichlet-multinomial model, which models texts in a hierarchical manner to capture the burstiness [7].In the model, the word count vector representing each document is generated by a multinomial distribution whose parameters are generated by its conjugate prior, that is, the Dirichlet distribution.The Dirichlet distribution is defined as where where we use the multinomial distribution function, (3), for ( |   ).The classifier using Dirichlet-multinomial modeling computes class-specific probability of a given document  as for each class  and assigns the document to the class with the highest probability.In the second expression of (8), we drop the term !/(∏ || =1   !), which does not depend on class .As in the case of   in multinomial modeling, one set of parameters for the Dirichlet,   ( = 1 ⋅ ⋅ ⋅ ||), must be evaluated for each possible class.For the evaluation, we use the leave-one-out likelihood maximization method proposed by Minka [13] which offers a convergent fixed-point iteration for the update as It has been confirmed that the Dirichlet-multinomial leads to better text modeling in the sense that it can describe the burstiness for all word types ranging from frequent words to rare words.This success has recently led to two major modeling approaches along the lines of hierarchical modeling of texts: beta-binomial modeling and beta-negative binomial modeling.

Beta-Binomial Model.
The beta-binomial distribution model is derived with consideration of a serious drawback in Dirichlet-multinomial modeling [9].If we use the Dirichlet distribution, (6), to describe the probability density of   , then it is concluded that words having the same expectation value in   also have the same variance in . This is an undesirable property of the Dirichlet for text modeling because we aim to model different words as having the same expected value but different variances in order to describe various word occurrence patterns.
Allison [9] addressed this problem with the following assumptions.
(i) The probability of the occurrence of a document  is a product of independent terms, each of which The assumptions described above allow means and variances for each   to be specified separately and lead us to an expression of the probability of document  as where (⋅) is the beta function and   and   are the parameters of the beta distribution.Note that the parameters   and   in the second assumption are replaced with   and   in the above equation since we aim to calculate ( |   ,   ) for each class  to build a classifier.The classifier computes the class-specific probability ( | ) ∝ ()( |   ,   ) and the document  is assigned to the most probable class.Following Allison [9], we use the method of moments to evaluate sets of parameters   and   for each class.
2.5.Beta-Negative Binomial Model.The beta-negative binomial model is derived from consideration of the adequate empirical fit of the negative binomial distribution for text modeling [8].The classifier using the model has been proved to be comparable to the multinomial naive Bayes in terms of classification performance, and this has been confirmed experimentally [8].Hence, we do not test this model in the present experiment.

Gamma-Poisson Modeling of Text
Its conjugate prior determining the density of   is given by the gamma distribution, the density function of which is We incorporate the gamma-Poisson description in our model under assumptions that are similar to the beta-binomial case.
(i) The probability of document  can be decomposed to a product of independent terms, each of which represents the probability of a number of emissions for each individual word.
(ii) Each term can be expressed as the average of a product 11), and Consequently, the probability of document  is where the parameters are replaced with class-specific ones.
In the last expression of ( 13), we can see that each term of the products becomes a mass function of the negative binomial distribution when   is an integer value [6].The classifier computes the class-specific probability ( | ) ∝ ()( |   ,   ), and the document  is assigned to the most probable class.

Normalization of Document Length.
To satisfy the conditions for using the gamma-Poisson description, documents must be normalized in length.Although several methods for normalizing document vectors have been proposed [12,14,15], we choose the simplest one that normalize a resulting vector in terms of  1 .The conversion of the th component in a count-valued document vector is expressed as which gives the predefined fixed length of the normalized document as .The normalization factor in the  1 sense is 1/ ∑ V =1   as seen in ( 14); in our experience, this factor leads to better classification performance than the common  2 sense normalization with a factor 1/√∑ V =1  2  .This is because, in generative modeling of text, the document length of  1 sense (i.e., total word count of a document) is considered to be the number of trials in which the selection of a considered word corresponds to a success.We set  = 100 for all documents because it allows intuitive understanding of the composition of normalized count vectors.
The normalization using ( 14) with  = 100 converts an integer-valued document vector into a real-valued one.This means that   in ( 13) changes from integer to real but it is not necessary to take into account the effects of this change when ( 13) is used in the classifier; the factorial   ! in (13) does not depend on class , and we can safely omit   !from the calculation of class-specific probability ( | ).Thus, the real-valued   does not cause any further difficulties in the evaluation of (13).

Estimation of Parameters.
To compute the class-specific probability ( | ) ∝ ()( |   ,   ), we must use sets of parameters   and   ( = 1 ⋅ ⋅ ⋅ ||) which are estimated for each specified class  from   (the set of training documents belonging to the class ).We use two different methods for the estimation: a rational approximation [16] and an iterative method [17].
The rational approximation estimates the parameters   and   through a set of equations: x = ( where   is the count of th word in th document and |  | is the number of documents belonging to the considered class.
The iterative method provides an update formula for   as where Ψ(⋅) and Ψ  (⋅) are the digamma and trigamma functions, respectively, and   and x are the same as in the rational approximation.The estimator for   is still defined as β = α /  .In the iteration, we use as an initial value of   [17] and apply a convergence criterion The resultant classification performance of each method in estimating parameters will be compared in the experimental section.

Experimental
To investigate the behavior of the proposed gamma-Poisson model described in the previous section, we perform experiments on automatic text categorization.In the experiments, the performance of a classifier using gamma-Poisson modeling is compared with the performance of other classifiers that also use probabilistic modeling but with different distributions.As mentioned above, the models selected for the comparison are the multinomial, the Dirichlet-multinomial, and the beta-binomial.In our experiments, the support vector machine (SVM) is also used as a standard discriminative classifier because previous comparative studies on classifiers [14,18] have consistently shown that SVM is the state of the art in terms of classification accuracy.

Data Set.
For our experiments, we use four different datasets that are chosen to represent a wide spectrum of text classification tasks.
The first one is the 20 Newsgroups dataset which was originally collected with a netnews-filtering system [19] and contains approximately 20,000 documents being partitioned (nearly) evenly across 20 different UseNet newsgroups.We use the 20news-18828 version (original data set is available from http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.data.html.20News-18828 is available from http://people.csail.mit.edu/jrennie/20Newsgroups/)from which cross-posts have been removed to give a total of 18,828 documents.Consequently, 20 Newsgroups is a single labeled dataset with approximately even class distribution, and the task is to apply one of the 20 possible labels to each test document.We build an initial vocabulary from all words left after stop word, punctuation, and number token removals.Capital letters are transformed to lowercase letters and no stemming algorithm is applied.Here, words are defined as alphabetical strings enclosed by whitespace.The size of the initial vocabulary is 103,135 words.
The second dataset is the Reuters-21578 data collection (data set is available from: http://kdd.ics.ics.uci.edu/databases/reuters21578/), which contains documents that appeared on the Reuters newswire in 1987 and were manually classified by personnel from Reuters Ltd.For applications in which enough numbers of training and test documents are required, the top 10 categories (the 10 categories with the highest number of positive training examples in the ModApte split) are usually used [2].However, since we want to use a single labeled dataset as stated in Section 2.1, we make a slight modification to the usual top 10 categories; specifically, we eliminate all documents with more than one topic (category); in this way, two categories among the top 10 are excluded.We use documents in the resultant 8 categories as our dataset.Consequently, the task is applying one of the 8 possible labels to each of test documents.In contrast to the 20 Newsgroups, the class distribution of this dataset is quite imbalanced; the largest topic category "earn" has 3,923 documents while the smallest "grain" has only 51 documents.The preprocessing used to build an initial vocabulary is the same as for the 20 Newsgroups, and the resulting vocabulary has 22,793 words.
The third test collection is the Industry Sector dataset which is a collection of corporate web pages organized into hierarchical categories based on what a company produces or does.Although it has a hierarchy with three levels of depth, we do not take the hierarchy into account and use a flattened version of the dataset.This dataset contains a total of 9,555 documents divided into 104 categories.(We obtained the dataset from http://www.cs.umass.edu/mccallum/codedata.html.Because it was found that one of the original 105 categories was empty, the remaining 104 categories having documents were used in our experiments.)We use all 9,555 documents in our experiments without removing the multilabeled documents because the fraction of multilabeled documents is very small and the effect of these documents is negligible (only 15 documents out of 9,555 belong to two classes; thus, they cannot affect our results considerably).The largest and smallest categories have 105 and 27 documents, respectively, and the average number of documents per category is 91.9.For this dataset, we remove HTML tags by skipping all characters between "<" and ">", and we did not use a stop list.The resulting vocabulary has 64,680 words.
The fourth test collection is the TechTC-100 dataset which is a collection of web pages taken from the web directory of the Open Directory Project (ODP) (the TechTC-100 dataset is available from http://techtc.cs.technion.ac.il/techtc100/).Because this test collection was generated from the web directory in a fully automated manner [20], it is noisier than the other three test collections; for example, textual advertisements included in the web pages can be noise that affects the classification accuracy.The original TechTC-100 dataset contains 100 datasets, each of which consists of positive and negative documents that define a binary classification task, and these positive and negative documents are chosen from the pairs of the ODP categories.Although there are 100 different combinations of positive and negative categories in the datasets, we only use positive documents because positive documents for each class are sufficient to define a multiclass classification problem.We found that 40 distinct positive categories in the 100 pairs of positive and negative categories, and therefore the task becomes applying one of the 40 possible labels to each of test documents.We did not apply the preprocessing steps to this test collection because it is supplied in a preprocessed plain text format.This collection has a vocabulary of 103,003 words.
For all four datasets, we use 10-fold cross-validation to make maximal use of the data and to allow comparison with the previous work by Allison [9].Ten obtained values of performance are averaged to give the final result.

Vector Creation.
To investigate the effect of vocabulary size on classification performance, we use a simple feature selection method based on the collection term frequency as follows.First, we count the collection term frequency, CF, which is the total frequency of each word throughout the entire dataset.Second, we select all words that satisfy CF ≥  0 where  0 is a predefined integer.The feature selection by CF is one of the simplest methods, but is sufficient for the task at hand, namely, comparing different classifiers at each vocabulary size.The resultant vocabulary sizes after feature selection are summarized in Table 1.
Two different types of document vectors, namely, countvalued and normalized, are used to represent each document.A count-valued document vector is constructed from document term frequency (number of occurrences of a considered word in a document) for each word, and then each component in the vector is converted by use of ( 14) to give the normalized vector.The count-valued document vectors are supplied to the classifiers with multinomial, Dirichletmultinomial, and beta-binomial modeling as training and test data.The normalized document vectors are supplied to the classifier with gamma-Poisson modeling that inherently requires the normalization.For SVM, we use both types of document vectors and find that the normalized vectors give better classification performance.We will thus show the performance of SVM with only normalized vectors.

Algorithm and Complexity.
The algorithm of our classifier using gamma-Poisson modeling is shown in Algorithm 1 for the training and test phases.In the training phase, the classifier estimates the values of parameters α and β for each class, from given training vectors; in the test phase, the classifier assigns the most probable label to a given test vector.As seen from the pseudocode in Algorithm 1, the time complexity of our classifier at the training phase with rational approximation of parameters is (||||) where  and  denote the set of all training documents and the vocabulary (the set of all terms satisfying CF ≥  0 in a corpus), while the complexity at the test phase is given by (||||).The classifiers using multinomial and betabinomial modeling also have the same complexity, (||||) and (||||), for the training and test phases, respectively, because they basically have the same code structure in our implementations and the differences are only in the equations for calculating parameters and estimating classspecific probabilities.On the other hand, the classifier using Dirichletmultinomial modeling and using gamma-Poisson modeling with iterative approximation ((20)) has complexity given by (|||| it ) at the training phase, where  it is the number of iteration cycles required for the convergence of the parameters.This is usually larger than (||||) since  it is typically of the order of ten and Dirichlet-multinomial modeling is therefore expensive in terms of computation time, as will be confirmed later.
Note that the time complexities described here ((||||) or (|||| it ) for the training phase and (||||) for the test phase) are all linearly dependent on the vocabulary size ||.This linear dependence will be confirmed empirically through comparisons of the practical computation times in the next section.

Implementation Issues.
Except SVM, all the classifiers are implemented in the Java programming language.Supplementary information is as follows.
(i) In the learning phase of the classifier utilizing the Dirichlet-multinomial model, initial values of   in the iterative evaluation with (9) are set to   = 0.5 for all .When an estimated value of   is equal to zero, which occurs when the corresponding th term failed to appear in all the training documents in considered class   , we replace the value with   = 1.0 × 10 −20 .This smoothing is similar to the method used by Madsen et al. [7].
(ii) For the classifier with the beta-binomial model, the estimated   also becomes zero when the corresponding th term fails to appear at the estimation using the method of moments.As proposed by Allison [9], to prevent any   from being zero, we supplement actual training documents with a pseudodocument in which every word occurs once.
(iii) For the classifier with the gamma-Poisson model, if   in ( 15) and ( 16) is zero, then we replace the value with   = 0.001 to prevent the geometrical means, x defined by ( 16), from being zero (we preliminarily tried four values of   for the replacement, namely,   = 0.1, 0.01, 0.001, 0.0001, and obtained the best performance when   = 0.001).Further, if the arithmetic mean given by ( 15) and the geometric mean given by ( 16) are equal, which happens when the corresponding th term fails to appear for all  (in all the training documents), we set   = 0.001 in (18).
(iv) To compute several special functions, namely, the gamma function in ( 8) and ( 13), the beta function in (10), and the digamma and the trigamma functions in (20), components offered by the Apache Commons Mathematics Library (the library is available from http://commons.apache.org/proper/commonsmath/)are used.
(v) For the SVM classifier, we use SVM multiclass which is one of the popular implementations (the implementation is available from http://svmlight.joachims.org/svmmulticlass.html). of the multi-class support vector machine In the training phase of SVM, the trade-off parameter between training error and margin, , is set to 5,000 to obtain high accuracy.For all other parameters, we use default values.
(vi) The experiments are conducted on a PC with a Phenom II X4 3.4 GHz processor and 8 GB of RAM.
(vii) A pilot implementation of the gamma-Poisson classifier using the C programming language is about 2.5 times faster than that using the Java.One should bear in mind this difference when comparing absolute computation times of our classifier with those of SVM, because the SVM classifier is implemented in C and the computation times of our classifier shown in the next section are those using the Java implementation.

Results
In this study, we use the simplest measure of classification performance, that is, accuracy, which is simply defined as a ratio of the total number of correct decisions to the total number of test documents in the dataset used.Note that for a single labeled dataset and a single labeled classification scheme as in this work, the microaveraged precision and recall are equivalent and hence equal to the F1 measure [23], which we termed here "accuracy".

Parameter Estimation for Gamma-Poisson Modeling.
We begin by considering the validity of the parameter estimation methods for gamma-Poisson modeling that were described in Section 3.3.As seen in Table 2, the accuracy values obtained for the two estimation methods are almost equivalent, indicating that the precision of the rational approximation and the convergence of the iterative method are both sufficient.
From this result, it is confirmed that both methods are valid for estimating parameters, although in terms of computational speed, the rational approximation is superior to the iterative method.We will show the results only for the rational approximation in the next section.for text modeling and to show the extent to which it can be appropriately used in text classification tasks.In this sense, the performance comparison between the classifier using the gamma-Poisson model and those using other models are our primary result in this work.Figure 1 shows the performance comparison of various classifiers in the text classification task for the 20 Newsgroups dataset, and Figure 2 shows the same for the Reuters-21578 data collection.The exact vocabulary sizes at each data point in these figures are given in Table 1.In Figure 1, the best performance is seen for SVM, but the classifiers using the beta-binomial and gamma-Poisson models are almost equivalent and are highly competitive with SVM; they are inferior to SVM only in the range of limited vocabulary size below 20,000 words.The classifiers using the multinomial and the Dirichlet-multinomial models are apparently worse than the other three classifiers in terms of classification accuracy.In Figure 2, SVM is again the best performer.An important difference between Figures 1 and 2 is that the classifier with the gamma-Poisson is superior to that with the betabinomial in Figure 2, whereas they are almost equivalent  in Figure 1.Another point is that the multinomial classifier performs better than the beta-binomial in Figure 2 but it was worse in Figure 1.The classifiers with the Dirichletmultinomial exhibit the worst performance in both cases.3 and 4, respectively, show comparisons of computation times for various classifiers in the tasks for the 20-Newsgroups and Reuters-21578 datasets.Note that the computation time of each classifier is defined here as the sum of the training time and the test time; the former is the time needed in order to estimate parameters from training vectors while the latter is the time needed to assign the most probable labels to test vectors.From the requirement of 10-fold cross-validation, the ratio of the numbers of training to test vectors is 9 : 1 and the sum of these gives the total number of documents in a considered corpus.Measured 10 values of computation time through 10fold cross validation are averaged and used as a final result in these figures.In Figures 3(a In Figures 3 and 4 the computation times of all the classifiers except SVM show almost linear dependence on the vocabulary size which is consistent with the time complexities of these classifiers described in Section 4.3.Clearly, SVM is very fast except in the limited vocabulary region.The classifier using beta-binomial modeling is slower than that using the gamma-Poisson because the method of moments used in the training phase of the beta-binomial classifier is time consuming compared with the rational approximation used in the gamma-Poisson classifier.The Dirichlet-multinomial classifier is the slowest because of the iterative procedures for estimating parameters.

Performance Comparison in Text Classification Tasks for
Industry Sector Dataset.Figures 5 and 6 show the classification accuracy and the computation time, respectively, in text classification for the Industry Sector dataset.As clearly seen in these figures, SVM is still the best performer in terms of the classification accuracy; however it is the worst in terms of the computation time.(We first used the binary version of SVM multiclass for Windows and found that the SVM multiclass suffers of memory errors when it is adapted to the Industry Sector and the TechTC-100 datasets.To avoid the error, we then used the SVM multiclass on Linux which was compiled with GCC.The results of SVM shown in this study were those obtained on Linux.The reason for the error is probably due to not enough heap memory available for the SVM multiclass on Windows.)The worst computation time of SVM indicates that this dataset has fundamentally different, undesirable characteristics for SVM.As will be discussed in the next section, the slow computation time of SVM is attributable to this dataset being not linearly separable because of its noisy nature.For the Industry Sector dataset, a reasonable and wellbalanced choice of classier is found to be the gamma-Poisson because it achieves the second best accuracy and requires moderate computation time.

Performance Comparison in Text Classification Tasks for
TechTC-100.Figures 7 and 8 show the results of classification accuracy and those of computation time, respectively, at the task on the TechTC-100 dataset.The results are very similar to those seen in the Industry Sector dataset; that is, SVM wins in terms of the classification accuracy but losses in terms of the computation time.Note that the computation time of gamma-Poisson, which gives the second best classification accuracy as in the case of the Industry Sector dataset, is about ten times faster than that of SVM.

Performance of the Probabilistic Classifiers.
As seen in the previous section, the overall trend in classification performance for the five tested classifiers can be summarized as follows: SVM ≥ gamma-Poisson ≥ beta-binomial where the symbols ">" and "≥" should be read as "better than" and "better than or equivalent to", respectively.Among the results, we first consider why the gamma-Poisson and the beta-binomial are superior to the Dirichlet-multinomial.The point is that these three classifiers perform differently; nevertheless, they have similar structures in terms of hierarchical text modeling with utilizing conjugate priors.This result probably arises from a difference in the properties of the conjugate priors used.As mentioned earlier, we cannot specify means and variances separately in the Dirichlet distribution while separate specification is possible for the beta distribution.Note that separate specification is also possible in the case of the gamma distribution; under the notations used in (12), the mean and variance of the gamma distribution are expressed as and hence Equations ( 23) and (24) give us the separate specification which guarantees a more flexible description of word occurrence in gamma-Poisson modeling compared with the description in Dirichlet-multinomial modeling.Thus, the origin of superiority of the beta-binomial and the gamma-Poisson over the Dirichlet-multinomial can be attributed to their flexibility to describe various patterns of word distribution.
We next consider the reason that the gamma-Poisson gives somewhat better performance than the beta-binomial.As described in any textbook on probability distribution, the binomial can be approximated by the Poisson when the number of trials goes to infinity and the expected number of successes remains fixed.It is therefore reasonable to expect that the gamma-Poisson is almost equivalent to the betabinomial but never exceeds it.Our expectation, however, is betrayed as seen in Figures 2, 5, and 7. A possible interpretation of this result is that the better performance of the gamma-Poisson model is attributable to the normalization of document vectors.In the beta-binomial model, each term occurrence is treated as being equally important in the estimation of parameters with the method of moments and in the classification of test documents.However, in vectors normalized in the  1 sense, the event of a word occurrence has remarkably different weight according to the original document length, and in our experiments these normalized vectors are only used for the gamma-Poisson and the SVM classifiers as training and test vectors.In the vector normalization, the word occurrence in a short document is converted to be more heavily weighted than that in a long document.The conversion in this manner is reasonable and considered to bring about the better performance because short documents usually have fewer unnecessary terms that are irrelevant to the topic, and the ratio of informative terms that represent a concept of the topic is higher than in long documents.From this aspect of the vector normalization, further discussion of a condition for improving accuracy is possible.If each document has almost the same length, then the normalization does not change the weight of word occurrences and thus does not contribute to the improvement of accuracy.Therefore, the normalization of document vectors can be effective in the situation where the distribution of document length is scattered in a considered dataset.
To confirm this, we introduce a measure that quantifies how many terms are used in a document vector.The measure we tentatively use here is a ratio of nonzero components defined as (ratio of nonzero components)

=
# nonzero components in the document vector # all components comprising the document vector .
For this ratio, we expect that when the distribution of the ratio becomes broader, the difference in accuracy between the gamma-Poisson and the beta-binomial will become increasingly evident.Figure 9 shows the distributions of the ratio of nonzero components for the four used datasets; the -axis represents the ratio and the -axis is the total number of documents which have that ratio of nonzero components.Statistic summaries of these four distributions are given in Table 3.In the table, we also give the values of where  GP is the classification accuracy of the gamma-Poisson classifier at the full vocabulary size and  BB is that of the beta-binomial classifier.Since Δ represents the difference in performance in percent between the gamma-Poisson and the beta-binomial, we can test our hypothesis on the vector normalization described above by examining the correlation between Δ and other statistic measures.As confirmed from the values of interquartile range (IQR) and standard deviation (SD) in Table 3 and also as intuitively seen in Figure 9, 20 Newsgroups has the sharpest distribution, and it consistently has the smallest Δ.The ratio is more broadly distributed in the Reuters-21578 and TechTC-100 datasets, and Δ thus takes larger values.These results indicate that  there is a positive correlation between Δ and the broadness of distribution, which supports our hypothesis regarding vector normalization.In comparison, the Industry Sector seems to have an unusually large value of Δ in relation to its IQR and SD values.However, this can be explained by the portion of very short documents being the largest in this dataset, which can be clearly seen in Figure 9. Since the improvement of accuracy with the vector normalization is more effective in short documents, the largest Δ for the Industry Sector dataset is consistent with our hypothesis.
The superiority of the gamma-Poisson over the beta-binomial observed for the Reuters-21578, Industry Sector, and TechTC-100 datasets is therefore explained, at least partially, in terms of the effectiveness of the vector normalization.Table 4 lists our best classification results for 20 Newsgroups using all the words in the initial vocabulary.In the table, data obtained by other authors are also shown for comparison.
As shown in the table, our results agree reasonably well with those by the other authors.A detailed comparison is given below.
(i) Our result for the Dirichlet-multinomial is similar to that reported by Madsen et al. [7], whereas the same model was found to perform worse in the study of Allison [9].This disagreement is attributed to the difference in the estimation of parameters; the iterative methods used in [7] and in this work give, in general, a better estimation than the method of moments used by Allison [9].
(ii) Our result for the beta-binomial is similar to the result reported by Allison [9].This is because we made efforts to ensure maximal comparability with the work of Allison [9] for fair comparison.
(iii) The difference among the three results for SVM probably arises from the difference of term weighting methods to create document vectors and from the difference in the parameter settings of SVM.For document vectors, we use normalized vectors in the  1 sense while more sophisticated term frequencyinverse document frequency (TF-IDF) was used by Kibriya et al. [22].All parameters were set to be their default values in the study by Allison [9], whereas we used a large value of  to obtain higher accuracy.(iv) The result of the multinomial in this work is almost identical to the result reported by Joachims [21], while Allison [9] and Madsen et al. [7] reported much worse values.The origin of this difference might be attributable to the differences in preprocessing (stop word removal and stemming) and vocabulary size.
Although we obtained relatively high classification accuracy for the multinomial in this work, it is clear that the gamma-Poisson outperforms the multinomial by some margin.Although the proposed classifier using gamma-Poisson modeling fails to outperform the SVM classifier as seen in the previous section, we believe that it is still useful for the following two reasons.
(i) The computation time of SVM can be intolerably deteriorated as in Figures 6 and 8, for the Industry Sector and the TechTC-100 datasets, respectively, while the gamma-Poisson classifier offers the second best classification performance with moderate computation times even for these two cases.
(ii) The gamma-Poisson classifier can be conveniently used for a wide range of practical systems in which continuous incremental learning tasks are required.
In the following, we first try to explain the origin of the slow computation times of SVM for the Industry Sector and the TechTC-100 datasets and then describe the effective incremental learning of the gamma-Poisson classifier which is suitable in practical systems.

Slow Computation Time of SVM for Noisy Dataset.
In general, the problem is linearly separable for SVM if each term in the vocabulary is almost peculiar to one of the all possible categories and, in this case, the SVM can be easily trained within a short time.Contrary, if many terms (a large portion of the vocabulary) tend to disperse over at least several categories with some finite probabilities, the linear separability decreases and the SVM suffers a long computation time to find an optimum hyperplane.For SVM, a dataset is regarded as noisy if the latter case occurs while the computation times of all the other probabilistic classifiers are not affected by the noisy nature because optimization procedures are not included in these classifiers.Concerning the origin of the noisy nature for the Industry Sector and the TechTC-100 datasets, we can consider following reasons.
(i) These datasets are generated from web pages.As mentioned earlier, the textual advertisements included in the web pages can make term distributions noisy because the same kind of advertisements tend to appear across multiple categories.
(ii) The numbers of classes, 104 for the Industry Sector and 40 for the TechTC-100, are much larger than those of the other two datasets.This causes the noisy nature because the similar concepts (almost similar topic) are nearly equally distributed among several adjacent categories when the categorization has been made in a fine-grained manner for a dataset with a large number of classes.
Figure 10 depicts the noisy nature of the Industry Sector and the TechTC-100 datasets.To obtain the figure, the following procedures were applied.
(1) All the terms in the vocabulary were sorted by the collection term frequency, CF, which is the total frequency of each word throughout the entire dataset.
(2) 2,000 top terms were chosen based on the CF values.
( (4) The top 2,000 terms were sorted by the value of ( top ) and plotted as in Figure 10 where the -axis shows the respective rank of ( top ) and the -axis the value of ( top ).
In the figure, we tentatively show the number of terms with ( top ) that is larger than 20%.The result indicates that about three-quarters of the top 2,000 terms are distributed over at least 6 categories for the Industry Sector and the TechTC-100 datasets.Therefore, these two datasets are noisy for SVM in the sense that we have described above compared with the other two datasets.
The performance of SVM for "noisy" data might be improved by using other kernels instead of using linear kernel which is utilized in this study; however, selection of a new kernel and the optimization of kernel parameters are newly introduced as additional tasks in this case.Furthermore, the fast computation time of SVM for a linearly separable dataset as demonstrated in Figures 3 and 4 cannot be expected with other kernels (the SVM multiclass is optimized for the linear kernel so that the runtime can be scaled linearly with the number of training examples by use of a cutting-plane algorithm).By contrast, the performance of the gamma-Poisson classifier is stable even in the case of noisy datasets with moderate computation times.

Effective Incremental Learning of the Gamma-Poisson
Classifier.As mentioned above, the gamma-Poisson classifier can be used for a wide array of practical systems in which continuous incremental learning tasks are required.Typical examples are spam-filtering and adaptive news-alert systems.In these systems, a small number of new training documents are continuously provided and the retraining of the classifier with the new training documents is routinely needed.In such an environment, the proposed classifier is more appropriate for practical systems rather than other complex learning models including SVM because our classifier ensures effective incremental learning.
The last expressions in (28) indicate that we can directly update the values of   and x from their original values.Because we can use these last expressions for the updates, and because the summation and the product over all training documents seen in the first expressions of (28) are actually unnecessary; the time complexity of retraining for all terms in the vocabulary using the combination of (28), ( 17), ( 18) and ( 19) is estimated to be only (||).This is a clear advantage of gamma-Poisson modeling for incremental learning for the following reasons.
(i) The time complexity of retraining the classifier using beta-binomial modeling in the same situation is given as (|||  |) because it uses the method of moments for the estimation of parameters.The complexity is much higher than the case of gamma-Poisson modeling.
(ii) The time complexity for retraining the classifier using multinomial modeling is (||), exactly the same as in the case of gamma-Poisson modeling.(If the total counts for each term over original training vectors are stored in the classifier, then the counts are simply updated by adding the corresponding counts in the newly provided training vector and the new set of parameters { θ } is immediately obtained by using (4) with these updated counts.This procedure results in complexity of (||).)However, from the comparison of overall performance described above, gamma-Poisson modeling is preferable over multinomial modeling since the former shows better performance.
(iii) In the case of SVM, it has been clarified that the support vectors, which can be regarded as summarized information of original training vectors, are sufficient for incremental learning [24].Along the line of this framework, the new set of training vectors becomes the union of support vectors obtained from original training vectors and one newly provided training vector; we can thus retrain SVM with this new training set for incremental learning.The time for training an SVM is dominated by the time for solving the underlying quadratic programming, and so the theoretical and empirical complexity varies depending on the method used to solve it.Although the number of support vectors is much smaller than ||, the time for doing quadratic optimization is considered to be much slower than simply counting terms [25] as is done in multinomial and gamma-Poisson modeling.
The last expressions in (28) also imply that storing the set of all training vectors is needless for the incremental learning because only the values of {  } and {x  } are sufficient for the update.This indicates that the gamma-Poisson classifier is suitable for incremental learning in terms of not only time complexity but also space complexity.

Conclusions and Future Work
In this paper, we have proposed a novel classifier in which the gamma-Poisson distribution is utilized as a new tool for text modeling.The gamma-Poisson was introduced in order to improve the insufficient description of word occurrences from the original Poisson distribution.The framework of gamma-Poisson modeling of texts and the construction of a classifier using the framework were demonstrated with practical techniques for parameter estimation and vector normalization.The efficiency of the proposed classifier was examined through experiments on automatic text categorization of the 20 Newsgroups, Reuters-21578, Industry Sector, and TechTC-100 datasets.For comparison, classifiers using three other distributions, namely, multinomial, Dirichlet-multinomial and beta-binomial distributions, were also applied to the same datasets, and in addition we also used SVM as a standard discriminative classifier with state-of-theart classification accuracy.From the results, it was found that the proposed classifier with the gamma-Poisson model shows classification performance comparable with that of SVM.The origin of the superiority of the proposed gamma-Poisson modeling was discussed in terms of its flexibility in describing various patterns of word distributions and the effectiveness of vector normalization.We also showed that the proposed classifier is most suitable for applications in which continuous incremental learning tasks are required.
At present, the analysis of classification performance for the various classifiers remains unsatisfactory because the results are interpreted only qualitatively.We consider that further quantitative discussion is possible through analysis of the decision functions of the classifiers.An investigation along the line of such quantitative analysis is reserved for future research.

Figure 1 :
Figure 1: Performance of various classifiers for 20 Newsgroups dataset.
) and 4(a) the horizontal and the vertical axes are represented in linear scales while they are shown logarithmically in Figures3(b) and 4(b) in order to show the lower vocabulary region clearly.

Figure 5 :
Figure 5: Classification performance of various classifiers for the Industry Sector dataset.

Figure 6 :
Figure 6: Computation time of text classification task for Industry Sector dataset.

Figure 10 :
Figure 10: Probability of the occurrence of a considered term in the most relevant category, ( top ), for the most frequent 2,000 terms.
= ( 1 ,  2 , . . .,  || ) is a parameter vector of the Dirichlet of which components determine the density of the   vector and where Γ(⋅) is the gamma function.The likelihood of a document  = ( 1 ,  2 , . . .,  || ) with length  (∑ || =1   = ) is given as an integral over   vectors weighted by a Dirichlet distribution: represents the probability of the number of emissions (i.e., the count) of an individual word.(ii) The probability of the number of emissions is the average of a product (  |   )(  |   ,   ) over   where   is the probability for the emission of th word   , (  |   ) is the binomial distribution representing the probability of   times occurrence of   , and (  |   ,   ) is beta-distributed weighting function of   .
The mass function of the Poisson describing the probability of   occurrences of the th word is Poisson description.The gamma-Poisson distribution is aPoisson distribution whose probability of the mean parameter  follows a gamma distribution with a shape parameter  and a rate parameter .

Table 1 :
Vocabulary size obtained by feature selection with CF.

end for(𝑐) 04 return argmax
(13)ore[]Algorithm 1: Algorithm of classifier using gamma-Poisson modeling.In the training phase, the procedure of learning over  (entire training documents) is given, while in the test phase, the procedure to classify one test document  is described.isthe set of training documents belonging to a class  and   is a document vector in   .Note that ∑            = ||, and thus the time complexity of the training phase is estimated as (|| ||).For the test phase, the complexity is found to be (|| ||) because log ( |   ,   ) in line 02 of the pseudo code is calculated through the entire summation over || (see(13)).

Table 2 :
Classification accuracy on the 20 Newsgroups dataset with two different methods for estimating parameters for the gamma-Poisson distribution.Values are shown as accuracy ± where  is the standard deviation calculated through 10-fold cross-validation.

Table 3 :
Statistical summary of the distribution of nonzero components for the four datasets used here.IQR and SD mean inter quartile range and standard deviation, respectively.
) A probability ( top ) defined by each of the top 2,000 terms.( top ) is regarded as the simple probability estimation for the occurrence of a considered term in the most relevant category.Note that, for example, if the considered term has ( top ) less than 20%, then the term is distributed over at least 6 categories.

Table 4 :
Classification results for the 20 Newsgroups dataset.For our results, we show the data obtained with all the words in the initial vocabulary.
To show the effectiveness, we consider a situation in which one new training document belonging to a class  with a document vector    = (  1 ,   2 , . . .,   || ) is supplied to our classifier.In this case, the set of training vectors becomes the union of original training vectors   = {  } (1 ≤  ≤ |  |) and one newly supplied vector    .The retraining of the classifier corresponds to estimating a new set of parameters ISRN Artificial Intelligence {α   } and { β  } (1 ≤  ≤ ||) from the |  |+1 training vectors.Referring to definitions of   and x ((15) and (16)), we can easily verify that the new values,    and x  for |  |+1 training vectors, can be expressed in terms of the original   and x as      ln x + ln    ) .