AdaBoost is an excellent committee-based tool for classification. However, its effectiveness and efficiency in multiclass categorization face the challenges from methods based on support vector machine (SVM), neural networks (NN), naïve Bayes, and
Machine learning- (ML-) based text categorization (TC) can be defined similar with other data classification tasks as the problem of approximating an unknown category assignment function
The approximating function
In text categorization projects, documents usually need be preprocessed to select suitable features. Then the document will be represented by their features. After the above steps, classifier will determine the category of the document. The flow chart of a TC task is shown in Figure
Flow chart of text categorization.
In different tasks preprocessing contains some or all of the following aspects: transform unstructured document into structured or semistructured format, word segmentation, and text feature selection. Feature selection is the most important part of preprocessing [
Imagine an international IT corporation which is interested in job seekers’ Java programming experience and English ability. The resume screening program of this company is actually a TC system. It can assist the managers choose appropriate employees. The system is shown in Figure
An instance of TC system.
Researchers made considerable achievements in the design of categorization algorithms because classifier is the key point in TC systems [
In a word, the purpose of classifier design and research in TC is to improve the performance and maintain the balance between performance and cost.
The rest of this paper is organized as follows. Section
Voting-based categorization algorithms also known as classifier committees can adjust the number and professional level of “experts” in the committees to find a balance between performance and time-computational consumption. These algorithms give up the effort to build single powerful classifier but try to integrate views of many weak classifiers. The philosophical principle of this methodology is
Unlike bagging method which trains the classifiers in parallel, in boosting the classifiers are trained sequentially. Before training the next classifier, the training set is reweighed for allocating greater weight to the documents that were misclassified by the previous classifiers [
The original boosting algorithm uses three weak classifiers
Scholars are committed to enhance the performance and reduce the overhead so a lot of improved boosting algorithm such as BrownBoost, LPBoost, LogitBoost, and AdaBoost were proposed. Majority comparative literatures proved that AdaBoost has the best performance among them [
Boosting and its relative algorithms get big success in several practices such as image processing, audio classification, and optical characters recognition (OCR). At the same time, boosting needs huge training sets, and thus sometimes the runtime consumption become unacceptable. Moreover, the weak classifiers’ lower limit of accuracy needs to be predicted.
To control the computational cost in a reasonable range, Shapire and Singer [ Given training set Let Define the initial distribution of documents in the training set Searching weak classifier Choose Recalculate the feature of samples:
Repeat the steps above Combine weak classifiers according to their weight to construct a strong classifier:
Training set utilization can be enhanced using the algorithm above through adjusting the weights of misclassified texts [
Researchers proposed some variants of AdaBoost focusing on different aspects such as precision, recall, robustness, computational overhead and multiclass categorization [
Performances of AdaBoost family members.
Figure
To solve problems above, we design weak classifiers with high accuracy and low complexity to limit the number of experts and thus keep the precision while reduce the consumption. More professional expert should play a more important rule, and misclassified documents should attract greater attention to further improve system’s performance. Therefore, more reasonable rules should be made to combine weak classifiers into strong classifier efficiently. In addition, this strong classifier should be used in multiclass classification tasks directly. Above is the motivation and purpose of this paper.
Theoretically once weak classifiers are more accurate than guess randomly (1/2 in two-class tasks or
Some researchers tried to design weak classifiers based on more powerful algorithms such as neural networks [
Example-based classification algorithms keep a balance between performance and cost [
To make the classification,
Schematic of two-class
The distance between two documents is calculated by a distance function:
As shown, the above function calculates the Euclidean distance between two documents in a linear space. Choose nearest
Two main problems in traditional
An adaptive group-based
Define
According to (
The value of
The detail work steps are shown in Figure
Detail work steps of AG
In this way, the algorithm can set value of
Therefore the main problems which limit the effectiveness of original
Weak classifier design is critical for differentiating positive samples and negative samples in training set. The precision of weak classifiers must be better than 50% to ensure the convergence of strong classifier. Therefore, the threshold
Define the weight of positive document
Accuracy of weak classifiers can be maintained above 0.5 by introducing and updating the threshold Calculate threshold Call AG Calculate the classification error Randomly choose and compare Update the threshold
Whether strong classifier has a good performance depends largely on how weak classifiers are combined. To build a powerful strong classifier, basis classifiers which have higher precision must take more responsibility in categorization process. Therefore categorization system should distinguish between the performances of weak classifiers and give them different weights according to their capabilities. Using these weights, boosting algorithms can integrate weak classifiers as the strong classifier in a more efficient way and achieve excellent performance [
Original AdaBoost algorithm uses a linear weighting way to generate strong classifier. In AdaBoost, strong classifier is defined as:
Given the training documents and category labels
Initialize weight Select a weak classifier with the smallest weighted error:
Prerequisite: Upper bounded Select Optimizing Reweighting as
The above-mentioned steps demonstrated that AdaBoost gives classifiers which have better classification performance higher weights automatically, especially by step
However, this weighting algorithm does not check the precision of former classifiers using the later training documents. In other words, the strong classifier generation is a single iterative process. Weak classifiers probably have different performances in different training samples. The weak classifiers which are considered should get higher weights by AdaBoost actually have better performance in former part of the training set. However, the basis classifiers may have good performance in later part but be ignored unreasonably. Therefore, credibility of weights is decreasing with the test sequence. This phenomenon can be called
To overcome these drawbacks, boosting algorithm should use a double iterative process for allocating weights to basis classifiers more reasonable.
In AdaBoost, weak classifiers with higher weights certainly can correctly process the documents which were misclassified by lower weights classifiers. It is important but not enough to improve categorization performance—two crucial problems are ignored as follows. Could basis classifiers with higher weight classify samples which already are rightly categorized by classifiers with lower weight in a high accuracy? If weak classifiers with lower weights also do not have the power to process documents which are misclassified by high-weight classifiers?
Weights’ credibility is reduced when the answers of these two problems are not absolutely positive. Therefore it is worth introducing the problems-aforementioned into weak classifiers weights allocation.
This paper proposed a double iterative weighted cascading (DIWC) algorithm to solve the two problems above and make the utilization of basis classifiers more efficient. The kernel ideal of DIWC is adding a weighting process by input training samples in reverse order. Comparing with original AdaBoost algorithm, we can call this process double iterative. Using the average weight of a basis classifier calculated in the two weighting process as the final weight. Introducing the average weight of two iterations to replace the weight using in traditional AdaBoost can avoid the weight bias problem because it takes the two problems above into account. It defines “powerful” for basis classifiers by using not only the former part but also the full training samples. The sketchy procedure chart of DIWC is shown in Figure
Procedure of DIWC algorithm.
DIWC can achieve weighting procedure shown in Figure Training the first classifier Calculation: calculating weights of basis classifiers according to the first round of loops (trainings). Reverse iterative: training Calculation: calculating weights of basis classifiers according to the second round of loops (trainings). Calculate final weights of basis classifiers according to step Cascade: combine basis classifiers according to their final weights and construct strong classifier.
There are three methods that can be used, respectively, to calculate the final weights with different accuracy and complexity.
The first method is quite simple: calculate the arithmetic mean of weights in two iterative loops and use it as weak classifiers’ final weights. This method has a very low computational cost. In this paper, it is called DIWC-1.
Note that some basis classifiers may have a very high weight both in the first and the second rounds of loops. It means these classifiers have global high categorization ability and should play a more important role in classification process instead of using the average weight simply. In this case, an upper bound value is set as the final weight of significantly powerful classifiers. On the other hand, some classifiers may have a very low weight in both two iterative loops. The utility of these classifiers must be limited by using a lower bound value to enhance system’s accuracy. This method spends more time on computing but has higher precision. It is called DIWC-2.
The third method concerns the situation that some weak classifiers may have a very high weight in one round of loops but a very low weight in another round of loops. One more iterative process is needed to determine the final weight. Especially, if the weights’ variance of three rounds is significantly large, the system will consider the weak classifiers as noise oversensitive and deduce its weight. This method can achieve the best precision and robustness. However its training consumption is also highest. We call it DIWC-3 in this paper.
The computational complexity of DIWC-1, DIWC-2, and DIWC-3 can be calculated by referring (
In DIWC-2, weights of two iterative processes will be compared, an upper bound
The DIWC-3 considers not only upper bound and lower bound but also the difference between weights in the two iterative loops. When the weights determined in the two loops have big difference, a third loop may be needed for final decision making. Similar to DIWC-2, the range of runtime complexity
As the analysis above, the computational complexity is proportional to
As per the review in Section
Actually, all the training documents which are categorized incorrectly should be gathered into an error set and use it to train every basis classifier. The accuracy will further progressing by using training documents in this way. The implementation of this method is quite convenient. Integrating this method with DIWC-1, DIWC-2, and DIWC-3 constructs the complete double iterative weighted cascade algorithm. The pseudocode of DIWC is shown in Algorithm
classifier 1 2 3 4 5 6 7 8 9 10 test 11 calculate 12 13 if 14 15 16 17 18 19 20 21 22 23 24
Majority members of AdaBoost family are oriented to two-class classification tasks. When solving multiclass problem, they often transform it into multiple two-class problems. These algorithms tend to have shortcomings in accuracy or efficiency and difficulty to achieve perfection when faced to multiclass categorization tasks. However, multiclass is a main problem in classification tasks. In many occasions simply using two-valued logic as yes or no can or cannot be hard to satisfy the requirements of categorization tasks. For instance, a news report may belong to politics, economics, sports, culture, new scientific discovery, or entertainment. In other words, processing multiclass classification tasks with higher performance should be the most important purpose of the boosting-based algorithm development.
As per the kernel algorithm of weak classifiers,
The above function reveals that
Traditional text categorization research often use the Euclidean distance or the Manhattan distance to measure the similarity between samples. However, when faced to multiclass categorization problems, these distance definitions cannot distinguish the importance between weights effectively [
In this way, the importance between weights can be distinguished effectively [
According to the analysis of the former subsection, weak classifiers in this paper are easily used in multiclass classification problems. However the performance can be further improved by changing the way of using strong classifiers.
Strong classifier tends to be used directly to solve two-class problem or independently to divide multiclass problem into several two-class problems in the AdaBoost family. This is probably the simplest way but certainly not the best way because the accuracy of two-class categorization cannot be further enhanced in strong classifying step and the complexity of multiclass categorization problem cannot be constraint efficiently.
Several strong classifiers can work together to solve the problems above. In this paper, we proposed a strong classifiers’ cascading method to further improve the precision and limit the consumption in multiclass classification tasks.
The method of integrating strong classifiers can be explained clearly by using examples. For instance, we can use four strong classifiers in series sequentially to determine which category a document belonged to. When they make the same judgment, use it as the final result. When they get different results, the principle of
Work logic of cascading strong classifiers.
Using the method of integrating strong classifiers in series can improve the classification accuracy because the Cramer-Rao bound is lower in this situation [
The novel text categorization tool in this paper—adaptive group
A text categorization system based on AG
In training step, we use documents set in which each document is a bit more or less than 2 KB and selected from standard copora to test the time needed for modeling when using AG
Time consumption of different size.
In Figure
As shown in the chart above, the time consumption increased when the number of neighbors or groups increased. Note that logarithmic coordinates are used in the figure, so time consumption increased significantly with the change of
To compare AG
Training time of different algorithms.
Time consumption will change with parameters and the way which combined strong classifiers. The training time of DIWC-1, DIWC-2, and DIWC-3 with different size of training set is tested as shown in Figure
Figure
AdaBoost is a big algorithm family. We choose most classic and most efficient algorithms—original AdaBoost, AdaBoost.M1, AdaBoost.MR, and AdaBoost.ECC to evaluate the runtime complexity level of our novel algorithms proposed in this paper. We used the same training set as the former experiment, and the result is shown in Figure
Time consumption of the AdaBoost family.
It is clearly shown in Figure
It should be noted that the performance difference of efficiency between AG
Experiment is made to evaluate performance of the system. Chinese news corpus support by Sogou Labs [
Precision comparison.
Algorithms | Text type | |||||
Economics | Politics | Sports | Weather | Entertainment | Culture | |
AdaBoost | 0.848 | 0.855 | 0.851 | 0.860 | 0.851 | 0.859 |
AdaBoost.M1 | 0.857 | 0.859 | 0.863 | 0.847 | 0.858 | 0.866 |
AdaBoost.MR | 0.854 | 0.862 | 0.847 | 0.865 | 0.855 | 0.862 |
AdaBoost.ECC | 0.848 | 0.854 | 0.841 | 0.843 | 0.840 | 0.856 |
Naïve Bayes | 0.769 | 0.794 | 0.783 | 0.806 | 0.811 | 0.772 |
SVM | 0.867 | 0.862 | 0.870 | 0.877 | 0.865 | 0.871 |
Neural network | 0.832 | 0.807 | 0.819 | 0.824 | 0.828 | 0.803 |
Decision tree | 0.809 | 0.792 | 0.786 | 0.831 | 0.799 | 0.812 |
AG | 0.887 | 0.894 | 0.882 | 0.911 | 0.877 | 0.893 |
AG | 0.899 | 0.906 | 0.903 | 0.921 | 0.898 | 0.902 |
AG | 0.918 | 0.905 | 0.917 | 0.924 | 0.903 | 0.907 |
Recall comparison.
Algorithms | Text type | |||||
Economics | Politics | Sports | Weather | Entertainment | Culture | |
AdaBoost | 0.852 | 0.857 | 0.849 | 0.863 | 0.859 | 0.861 |
AdaBoost.M1 | 0.852 | 0.864 | 0.863 | 0.877 | 0.862 | 0.865 |
AdaBoost.MR | 0.858 | 0.863 | 0.849 | 0.866 | 0.851 | 0.867 |
AdaBoost.ECC | 0.851 | 0.844 | 0.845 | 0.850 | 0.846 | 0.849 |
Naïve Bayes | 0.761 | 0.798 | 0.782 | 0.804 | 0.817 | 0.805 |
SVM | 0.868 | 0.865 | 0.874 | 0.874 | 0.867 | 0.876 |
Neural network | 0.834 | 0.809 | 0.823 | 0.811 | 0.825 | 0.807 |
Decision tree | 0.815 | 0.798 | 0.784 | 0.819 | 0.799 | 0.813 |
AG | 0.897 | 0.888 | 0.890 | 0.913 | 0.885 | 0.905 |
AG | 0.909 | 0.914 | 0.914 | 0.922 | 0.891 | 0.908 |
AG | 0.921 | 0.917 | 0.919 | 0.923 | 0.911 | 0.916 |
Algorithms | Text type | |||||
Economics | Politics | Sports | Weather | Entertainment | Culture | |
AdaBoost | 0.850 | 0.856 | 0.850 | 0.862 | 0.855 | 0.860 |
AdaBoost.M1 | 0.855 | 0.862 | 0.863 | 0.876 | 0.860 | 0.866 |
AdaBoost.MR | 0.856 | 0.863 | 0.848 | 0.866 | 0.853 | 0.865 |
AdaBoost.ECC | 0.849 | 0.849 | 0.848 | 0.847 | 0.843 | 0.847 |
Naïve Bayes | 0.765 | 0.796 | 0.783 | 0.805 | 0.814 | 0.804 |
SVM | 0.868 | 0.864 | 0.872 | 0.874 | 0.876 | 0.864 |
Neural network | 0.833 | 0.808 | 0.821 | 0.808 | 0.827 | 0.805 |
Decision tree | 0.812 | 0.795 | 0.785 | 0.825 | 0.799 | 0.813 |
AG | 0.896 | 0.888 | 0.896 | 0.912 | 0.881 | 0.899 |
AG | 0.895 | 0.910 | 0.909 | 0.922 | 0.906 | 0.903 |
AG | 0.907 | 0.911 | 0.918 | 0.924 | 0.920 | 0.912 |
Average performance of algorithms.
Algorithms | Index | ||
Precision | Recall | ||
AdaBoost | 0.854 | 0.857 | 0.856 |
AdaBoost.M1 | 0.858 | 0.864 | 0.861 |
AdaBoost.MR | 0.857 | 0.862 | 0.860 |
AdaBoost.ECC | 0.847 | 0.857 | 0.852 |
Naïve Bayes | 0.789 | 0.795 | 0.792 |
SVM | 0.869 | 0.871 | 0.870 |
Neural network | 0.819 | 0.818 | 0.819 |
Decision tree | 0.805 | 0.805 | 0.705 |
AG | 0.899 | 0.896 | 0.898 |
AG | 0.905 | 0.910 | 0.908 |
AG | 0.916 | 0.918 | 0.917 |
As shown in the above-mentioned tables, AG
Therefore, AG
It is interesting to note that classification in weather reports has the best precision and recall. It is probably because weather reports are quite simple and always contain similar features or key words such as
An improved boosting algorithm based on
However, support vector machine as one of the best classification algorithm and its usage as weak classifier combined by ideas which are similar with DIWC is a virgin land in text categorization. Moreover, there is space for improving the accuracy and efficiency of AG
The material presented in this paper is partly based upon work supported by the China Association for Science and Technology. Experimental data is offered by the Sogou Labs.