Multiclass Boosting with Adaptive Group-Based kNN and Its Application in Text Categorization

AdaBoost is an excellent committee-based tool for classification. However, its effectiveness and efficiency in multiclass categorization face the challenges from methods based on support vector machine SVM , neural networks NN , naı̈ve Bayes, and k-nearest neighbor kNN . This paper uses a novel multi-class AdaBoost algorithm to avoid reducing the multi-class classification problem to multiple two-class classification problems. This novel method is more effective. In addition, it keeps the accuracy advantage of existing AdaBoost. An adaptive group-based kNN method is proposed in this paper to build more accurate weak classifiers and in this way control the number of basis classifiers in an acceptable range. To further enhance the performance, weak classifiers are combined into a strong classifier through a double iterative weighted way and construct an adaptive group-based kNN boosting algorithm AGkNN-AdaBoost . We implement AGkNN-AdaBoost in a Chinese text categorization system. Experimental results showed that the classification algorithm proposed in this paper has better performance both in precision and recall than many other text categorization methods including traditional AdaBoost. In addition, the processing speed is significantly enhanced than original AdaBoost and many other classic categorization algorithms.


Introduction
Machine learning-ML-based text categorization TC can be defined similar with other data classification tasks as the problem of approximating an unknown category assignment function F : D × C → {0, 1}, where D is the set of all possible documents and C is the set of predefined categories 1 : The approximating function M : D × C → {0, 1} is called a classifier, and the task is to build a classifier that produces results as "close" as possible to the true category assignment function F 2 , for instance, whether an article belongs to fiction, whether a short message belongs to advertisement, or whether the author of a script is Shakespeare and so forth.
In text categorization projects, documents usually need be preprocessed to select suitable features.Then the document will be represented by their features.After the above steps, classifier will determine the category of the document.The flow chart of a TC task is shown in Figure 1.
In different tasks preprocessing contains some or all of the following aspects: transform unstructured document into structured or semistructured format, word segmentation, and text feature selection.Feature selection is the most important part of preprocessing 3 .Features can be characters, words, phrases, concepts, and so forth 4 .Document representation is the process of using features with different weights to show texts.Classifier's kernel is machine learning algorithm.It uses the document representation as its input and then outputs the categorization results.
Imagine an international IT corporation which is interested in job seekers' Java programming experience and English ability.The resume screening program of this company is actually a TC system.It can assist the managers choose appropriate employees.The system is shown in Figure 2.
Researchers made considerable achievements in the design of categorization algorithms because classifier is the key point in TC systems 5 .Several of the most important methods include naïve Bayes, support vector machine SVM , k-nearest neighbor kNN , decision tree DT , neural networks NN , and voting-based algorithms such as AdaBoost.Some comparative experiments revealed that SVM, kNN, and AdaBoost have the best precision, Naïve Bayes has the worst performance but very useful as baseline classifiers because of its ease of use.Performance of DT and NN are worse than the top 3 methods but the computational complexity is also lower 6-8 .In a word, the purpose of classifier design and research in TC is to improve the performance and maintain the balance between performance and cost.
The rest of this paper is organized as follows.Section 2 reviews related work and analyzes the goal of this paper.Section 3 improves classic kNN to build weak classifiers based on it.In Section 4, a double iterative weighted cascading algorithm is proposed to construct a strong classifier.Section 5 then modified the AdaBoost based on Sections 3 and 4 to solve multiclass problems.The application of the novel classification algorithm is presented and analyzed in Section 6.Finally, Section 7 summarizes the paper.

Related Work and Motivation
Voting-based categorization algorithms also known as classifier committees can adjust the number and professional level of "experts" in the committees to find a balance between performance and time-computational consumption.These algorithms give up the effort to build single powerful classifier but try to integrate views of many weak classifiers.The  philosophical principle of this methodology is the truth always held in majority.Bagging and boosting are the two kinds of most popular voting-based methods.

Boosting Algorithm
Unlike bagging method which trains the classifiers in parallel, in boosting the classifiers are trained sequentially.Before training the next classifier, the training set is reweighed for allocating greater weight to the documents that were misclassified by the previous classifiers 9 .Therefore, the system can pay serious attention on controversial texts and enhance the precision.
The original boosting algorithm uses three weak classifiers c 1 , c 2 , c 3 to form a committee.It divides a large training set into three parts X 1 , X 2 , X 3 randomly and use X 1 to train c 1 firstly.Then it uses the subset of X 1 which is misclassified by c 1 and the subset which is categorized rightly by c 1 together as the training set of c 2 .The rest can be done in the same manner.
Scholars are committed to enhance the performance and reduce the overhead so a lot of improved boosting algorithm such as BrownBoost, LPBoost, LogitBoost, and AdaBoost were proposed.Majority comparative literatures proved that AdaBoost has the best performance among them 10 .

Detail of AdaBoost
Boosting and its relative algorithms get big success in several practices such as image processing, audio classification, and optical characters recognition OCR .At the same time, boosting needs huge training sets, and thus sometimes the runtime consumption become unacceptable.Moreover, the weak classifiers' lower limit of accuracy needs to be predicted.
To control the computational cost in a reasonable range, Shapire and Singer 11 proposed AdaBoost.It uses a dual-weighted process to choose training sets and classifiers.The detailed steps of AdaBoost are as follows: 1 Given training set x 1 , y 1 , x 2 , y 2 , . . ., x n , y n where x i is the training sample and y i ∈ {1, −1} denotes x i 's category label 1 ≤ i ≤ n .
2 Let f j x i denote ith feature of jth document.
3 Define the initial distribution of documents in the training set 4 Searching weak classifier c t t 1, 2, . . ., T : for jth feature of every sample, a weak classifier c j can be obtained and thus get the threshold θ j and orientation P j to minimum the error ε j as follows: Therefore, the weak classifier c j is c j x 1 P j f j x < P j θ j −1 otherwise. 2.2 5 Choose c j from the whole feature space which has the minimal error ε j as the weak classifier.
6 Recalculate the feature of samples: where Z t is a normalization factor which makes n i 1 D t 1 i 1 and α i is the weight.
7 Repeat the steps above T times and get T optimal weak classifiers with different weights.
8 Combine weak classifiers according to their weight to construct a strong classifier: Training set utilization can be enhanced using the algorithm above through adjusting the weights of misclassified texts 12 .In addition, the performance of strong classifier is improved because it is constructed in a weighted way 13 .In a word, AdaBoost has lower training consumption and higher accuracy than original boosting algorithms.Researchers proposed some variants of AdaBoost focusing on different aspects such as precision, recall, robustness, computational overhead and multiclass categorization 14 .We called these algorithms AdaBoost family.The three most important indicators are precision, efficiency, and the ability of multiclass categorization.Performances of AdaBoost family members are shown in Figure 3 15 .

Problems of AdaBoost Family and Motivation of This Article
Figure 3 reveals that few algorithms in AdaBoost family can achieve high precision and high efficiency at the same time specifically in multiclass-oriented categorization problems.Unfortunately, multiclass is the main problem in classification tasks.Traditional methods which translate the multiclass problem into multiple two-class problems will reduce accuracy and increase complexity of the system 16 .
To solve problems above, we design weak classifiers with high accuracy and low complexity to limit the number of experts and thus keep the precision while reduce the consumption.More professional expert should play a more important rule, and misclassified documents should attract greater attention to further improve system's performance.Therefore, more reasonable rules should be made to combine weak classifiers into strong classifier efficiently.In addition, this strong classifier should be used in multiclass classification tasks directly.Above is the motivation and purpose of this paper.

Weak Classifiers with AGkNN
Theoretically once weak classifiers are more accurate than guess randomly 1/2 in two-class tasks or 1/n in multiclass tasks , AdaBoost can integrate them into a strong classifier whose precision can infinitely be close to the true category distribution 17 .However, when the precision of weak classifiers are lower, more weak classifiers are needed to construct a strong classifier.Too many weak classifiers in the system sometimes increase its complexity and Mathematical Problems in Engineering computational consumption to intolerable level.Categorization systems use naïve Bayes or C4.5 as their weak classifiers may face this problem.
Some researchers tried to design weak classifiers based on more powerful algorithms such as neural networks 18 and support vector machine 19 .These algorithms can certainly achieve higher accuracy but lead to some new problems because they are over complex and thus contrary to the ideology of boosting.

k-Nearest Neighbor
Example-based classification algorithms keep a balance between performance and cost 20 .k-nearest neighbor is the most popular example-based algorithm as it has higher precision and lower complexity.
To make the classification, kNN transforms the target documents into representational feature vectors which have same formation with training samples.Then it calculates distance between the target document and the selected k neighbors 21 .Finally the category of target document is determined according to their neighbors' class.The schematic of two-class kNN is shown in Figure 4.
The distance between two documents is calculated by a distance function:

3.1
As shown, the above function calculates the Euclidean distance between two documents in a linear space.Choose nearest k neighbors as the reference documents, then the category C j which includes most neighbors can be found as where d i is the ith training document, Sim a, b is the similarity of document a, and document b, y α, β represent the probability of document α belong to category β.

Adaptive Group-Based kNN Categorization
Two main problems in traditional kNN are experience dependent and sample category balance.Experience dependent means k is an empirical value that need be preset 22 .Sample category balance notes that when the numbers of samples belonging to different categories have large gap, the classification results tend to be inaccurate.In other words, the system expects the category distribution of the samples as even as possible.An adaptive group-based kNN AGkNN algorithm is proposed in this paper to solve problems above.The basic idea of AGkNN is to divide the training set into multiple groups.Use kNN to categorizing target documents parallel in each subset with a random initial value of k, and then compare classifying results.If the results of different groups are broadly consistent with each other, keep the group size and k's value.When the results are highly similar to each other, emerge groups and reduce the value of k.If the results are totally different from each other, increase the value of k.Especially when two opposing views are counterbalanced, the system should increase the numbers of groups to make final decision.
Define N g i as the number of training groups when processing ith document.Give every category a number as its class name.For example, define a news reports set {financial news, sports news, entertainment news, political news, weather report} as {1, 2, 3, 4, 5}, C j i as the categorization result of ith document determined by jth group-for example C 4 5 means in the forth group's opinion, the document which be classified is a weather reportand C as the average value of different categories calculated by feature distance in different groups.For instance, a document is classified as "sports news, sports news, entertainment news, entertainment news and sports news" in 5 training groups; the average value of categories C can be calculated as C 2 2 3 3 2 /5 2.4.The number of groups can be adaptively determined as

3.3
According to 3.3 , the system can determine whether and how to adjust the grouping situation of samples by making reference to the variance of classification results given by different groups.When the variance of result given by each group is higher than the threshold, it means the categorization is not accurate enough because argument exists and more groups are needed to make a final decision.On the other hand, when the variance is very low, it means there are almost no disputes in classification and the sample groups can be merged to saving time consumption.In this paper, we use 1/C and C as the lower bound and higher bound because the average value of categories is empirically suitable and convenient to be used as threshold.
The value of k can be calculated adaptively as: where γ i is the random initial value of k.The system can test if the random initial k is suitable.It can judge whether majority classifier reached agreement according to the variance to inferring if the categorization result is precise enough or not, moreover, to adjust the value of k adaptively to get a more accurate result.
The detail work steps are shown in Figure 5.
In this way, the algorithm can set value of k adaptively and take full use of training set because the training set is grouped automatically and the value of k is initialed randomly.Furthermore, the system can adjust the number of groups and reference neighbors adaptively by calculate and update variance of categorization results given by different groups in real time.There is no condition missing in the algorithm; in addition, the core of the algorithm is still kNN algorithm whose convergence had been proofed in 23 , so the AGkNN converges.The runtime complexity of solving the variance of n elements is 3n − 1, so the computational complexity T of one document classification in this algorithm is Therefore the main problems which limit the effectiveness of original kNN for a long time are eliminated in AGkNN.Experience-dependent problem be solved means the algorithm can achieve higher efficiency and robustness.Overcoming the drawback of need category balance means the system can improve its accuracy.In summary, AGkNN has better performance and lower complexity than classic kNN.It is wonderful as the weak classifier in AdaBoost.

Generating Weak Classifiers
Weak classifier design is critical for differentiating positive samples and negative samples in training set.The precision of weak classifiers must be better than 50% to ensure the convergence of strong classifier.Therefore, the threshold θ needs to be set for weak classifiers to guarantee the system performance by combining them into a more powerful strong classifier 24 .Define the weight of positive document x as w p , the weight of negative document y as w n , the positive threshold as θ p , and the negative threshold as θ n .The threshold can be calculated as θ p min max w p x − w p x w n x , θ n min max w n x − w n x w p x , θ θ p , θ p ≤ θ n θ n , otherwise.

3.6
Accuracy of weak classifiers can be maintained above 0.5 by introducing and updating the threshold θ.Therefore, weak classifiers based on AGkNN can be generated by following the steps below.

DIWC Algorithm: A Tool for Constructing Strong Classifier
Whether strong classifier has a good performance depends largely on how weak classifiers are combined.To build a powerful strong classifier, basis classifiers which have higher precision must take more responsibility in categorization process.Therefore categorization system should distinguish between the performances of weak classifiers and give them different weights according to their capabilities.Using these weights, boosting algorithms can integrate weak classifiers as the strong classifier in a more efficient way and achieve excellent performance 25 .

Review Weighting Mechanism in Original AdaBoost
Original AdaBoost algorithm uses a linear weighting way to generate strong classifier.In AdaBoost, strong classifier is defined as: 5 Reweighting as

4.3
The above-mentioned steps demonstrated that AdaBoost gives classifiers which have better classification performance higher weights automatically, especially by step 5 .In this way, AdaBoost can be implemented simply.Its feature selection is on a large set of features.Furthermore, it has good generalization ability.
However, this weighting algorithm does not check the precision of former classifiers using the later training documents.In other words, the strong classifier generation is a single iterative process.Weak classifiers probably have different performances in different training samples.The weak classifiers which are considered should get higher weights by AdaBoost actually have better performance in former part of the training set.However, the basis classifiers may have good performance in later part but be ignored unreasonably.Therefore, credibility of weights is decreasing with the test sequence.This phenomenon can be called weight bias.Weight bias could lead to suboptimal solution problem and make the system oversensitive to noise.Accuracy is affected by the above problems and the robustness of system is decreased.
To overcome these drawbacks, boosting algorithm should use a double iterative process for allocating weights to basis classifiers more reasonable.

Double Iterative Weighted Cascading Algorithm
In AdaBoost, weak classifiers with higher weights certainly can correctly process the documents which were misclassified by lower weights classifiers.It is important but not enough to improve categorization performance-two crucial problems are ignored as follows.
1 Could basis classifiers with higher weight classify samples which already are rightly categorized by classifiers with lower weight in a high accuracy?
2 If weak classifiers with lower weights also do not have the power to process documents which are misclassified by high-weight classifiers?
Weights' credibility is reduced when the answers of these two problems are not absolutely positive.Therefore it is worth introducing the problems-aforementioned into weak classifiers weights allocation.
This paper proposed a double iterative weighted cascading DIWC algorithm to solve the two problems above and make the utilization of basis classifiers more efficient.The kernel ideal of DIWC is adding a weighting process by input training samples in reverse order.Comparing with original AdaBoost algorithm, we can call this process double iterative.Using the average weight of a basis classifier calculated in the two weighting process as the final weight.Introducing the average weight of two iterations to replace the weight using in traditional AdaBoost can avoid the weight bias problem because it takes the two problems above into account.It defines "powerful" for basis classifiers by using not only the former part but also the full training samples.The sketchy procedure chart of DIWC is shown in Figure 6.
DIWC can achieve weighting procedure shown in Figure 6 by the following steps below. . .

Iterative step n
Iterative step n + 1 Iterative step 2n 7 Calculation: calculating weights of basis classifiers according to the second round of loops trainings .
8 Calculate final weights of basis classifiers according to step 4 and step 7 9 Cascade: combine basis classifiers according to their final weights and construct strong classifier.
There are three methods that can be used, respectively, to calculate the final weights with different accuracy and complexity.
The first method is quite simple: calculate the arithmetic mean of weights in two iterative loops and use it as weak classifiers' final weights.This method has a very low computational cost.In this paper, it is called DIWC-1.
Note that some basis classifiers may have a very high weight both in the first and the second rounds of loops.It means these classifiers have global high categorization ability and should play a more important role in classification process instead of using the average weight simply.In this case, an upper bound value is set as the final weight of significantly powerful classifiers.On the other hand, some classifiers may have a very low weight in both two iterative loops.The utility of these classifiers must be limited by using a lower bound value to enhance system's accuracy.This method spends more time on computing but has higher precision.It is called DIWC-2.
The third method concerns the situation that some weak classifiers may have a very high weight in one round of loops but a very low weight in another round of loops.One more iterative process is needed to determine the final weight.Especially, if the weights' variance of three rounds is significantly large, the system will consider the weak classifiers as noise oversensitive and deduce its weight.This method can achieve the best precision and robustness.However its training consumption is also highest.We call it DIWC-3 in this paper.
The computational complexity of DIWC-1, DIWC-2, and DIWC-3 can be calculated by referring 3.5 .Set m as the number of documents would be classified.The runtime complexity T 1 of DIWC-1 is quite simple as 4.4 In DIWC-2, weights of two iterative processes will be compared, an upper bound σ h will be introduced when classifiers have a very high weight both in the first and the second rounds of loops, and a lower bound σ l will be introduced when classifiers have a very low weight both in the first and the second rounds of loops.Because not every basis classifier needs an upper bound/lower bound and introduces bounds leading to extra computational consumption, so the runtime complexity T 2 ranges in 4.5 The DIWC-3 considers not only upper bound and lower bound but also the difference between weights in the two iterative loops.When the weights determined in the two loops have big difference, a third loop may be needed for final decision making.Similar to DIWC-2, the range of runtime complexity T 3 can be described as 4.6 As the analysis above, the computational complexity is proportional to k, m, and N 2 g ; when the number of classification objects increases, the time consumption will increase linearly.Therefore the algorithms avoid index explosion problem and have an acceptable runtime complexity.In addition, the algorithms are converged because no condition is missing and the values of weights are infinity.

Using Training Sets More Efficiently
As per the review in Section 4.1, traditional AdaBoost gives documents which is misclassified by former weak classifiers with higher importance to improve the system's ability to categorize "difficult" documents.This ideal is helpful for making AdaBoost achieves better precision than former boosting algorithms.However, AdaBoost still leaves space for improving the efficiency of using training documents.
Actually, all the training documents which are categorized incorrectly should be gathered into an error set and use it to train every basis classifier.The accuracy will further progressing by using training documents in this way.The implementation of this method is quite convenient.Integrating this method with DIWC-1, DIWC-2, and DIWC-3 constructs the complete double iterative weighted cascade algorithm.The pseudocode of DIWC is shown in Algorithm 1 where e i is the error set of the ith basis classifier, ω 1 i is the weight of the ith classifier in the first iterative loop, ω 2 i is the weight of the ith classifier in the second iterative loop, ε is the lower threshold of the difference between ω 1 i and ω 2 i , σ h is the upper threshold of weight, W MAX is the upper bound, σ l is the lower threshold of weight, W MIN is the lower bound, δ is the upper threshold of the difference between ω 1 i and ω 2 i , and ω i is the final weight of the ith classifier.

Multiclass Classification
Majority members of AdaBoost family are oriented to two-class classification tasks.When solving multiclass problem, they often transform it into multiple two-class problems.These algorithms tend to have shortcomings in accuracy or efficiency and difficulty to achieve perfection when faced to multiclass categorization tasks.However, multiclass is a main problem in classification tasks.In many occasions simply using two-valued logic as yes or no can or cannot be hard to satisfy the requirements of categorization tasks.For instance, a news report may belong to politics, economics, sports, culture, new scientific discovery, or entertainment.In other words, processing multiclass classification tasks with higher performance should be the most important purpose of the boosting-based algorithm development.

kNN in Multiclass Classification
As per the kernel algorithm of weak classifiers, k-nearest neighbor has a nature advantage to solve multiclass problems.The mathematical expression of kNN is 5.1 The above function reveals that kNN algorithm can easily be used in multiclass classification problems, because unlike other categorization algorithms, kNN does not divide the problem into two subspaces or two subparts, but it records the class label C j directly.Therefore, it need not to be premodified much to satisfy the multiclass categorization problem.
Traditional text categorization research often use the Euclidean distance or the Manhattan distance to measure the similarity between samples.However, when faced to multiclass categorization problems, these distance definitions cannot distinguish the importance between weights effectively 27 .To solve this problem, the Mahalanobis distance is used in this paper: And the distance between vector X i and X j is defined as Algorithm 1: Pseudocode of DIWC Double iterative weighted cascading.
In this way, the importance between weights can be distinguished effectively 28 .Because kNN can be easily used in multiclass situation, we can construct strong classifier without big modification of the basis classifier itself.

Integrating Strong Classifiers
According to the analysis of the former subsection, weak classifiers in this paper are easily used in multiclass classification problems.However the performance can be further improved by changing the way of using strong classifiers.
Strong classifier tends to be used directly to solve two-class problem or independently to divide multiclass problem into several two-class problems in the AdaBoost family.This is probably the simplest way but certainly not the best way because the accuracy of two-class categorization cannot be further enhanced in strong classifying step and the complexity of multiclass categorization problem cannot be constraint efficiently.
Several strong classifiers can work together to solve the problems above.In this paper, we proposed a strong classifiers' cascading method to further improve the precision and limit the consumption in multiclass classification tasks.
The method of integrating strong classifiers can be explained clearly by using examples.For instance, we can use four strong classifiers in series sequentially to determine which category a document belonged to.When they make the same judgment, use it as the Mathematical Problems in Engineering final result.When they get different results, the principle of the minority is subordinate to the majority could be used.Especially, when two different determinations are counterparts, a reclassification process is needed to get the final result.The work logic of this method is shown in Figure 7.
Using the method of integrating strong classifiers in series can improve the classification accuracy because the Cramer-Rao bound is lower in this situation 29 .The derivation and definition of original Cramer-Rao bounds contain too many integral functions and thus very complex, so we use the modified Cramer-Rao bounds MCRBs in this paper as below 30 : where p X | μ, τ is the conditional probability of X when given variables μ and τ.Reference 31 had proved that, in text categorization, the average-result of multiple classifiers has lower MCRBs than result by single classifier.Therefore, system's precision can improved by this method.However, input documents to strong classifiers in series will significantly extend the categorization time.To save process time, strong classifiers can be combined in parallel, but in this way, the computational consumption will be increased.To keep balance between time and computational consumption, when implement the strong classifiers integrating method in real systems, users should decide combine them in series or in parallel according the size of documents collection, mutual information MI of different categories, the hardware capability, and time consumption tolerance of different systems and different projects.

Application and Analysis
The novel text categorization tool in this paper-adaptive group k nearest neighbor-based AdaBoost AGkNN-AdaBoost algorithm-is fully proposed in the former sections.To evaluate its performance we tested its training time by Matlab with different submodules and parameters.We also measured the time consumption of other algorithms and made comparison to analyze whether and why AGkNN-AdaBoost is better than many other tools, furthermore, which parts make the contributions for its efficiency beyond some algorithms and what mechanisms make it spend more training time than other algorithms.A text categorization system based on AGkNN-AdaBoost is implemented, and plenty of standard corpora texts are used to measure its precision, recall, F 1 .and so forth with different submodule and different initial parameters.We compared AGkNN-AdaBoost's performance not only with the AdaBoost family algorithms but also with some other classic classification algorithms such as SVM, decision tree, neural networks, and Naïve Bayes.We analyzed all data carefully and took the time consumption into account to make our final conclusion about the novel tool's performance.

Experiment for Time Consumption Analysis
In training step, we use documents set in which each document is a bit more or less than 2 KB and selected from standard copora to test the time needed for modeling when using AGkNN-AdaBoost.The relationship between the number of documents and time consumption by using different parameters and random model is shown in Figure 8.
In Figure 8, we test the modeling time with training sets containing 10 documents, 50 documents, 200 documents, 1000 documents, 5000 documents, and 20000 documents.We select k 3, k 4, k 5, and random k as the number of reference neighbors.In each situation, we set the number of group as 4, 5, and 6 to evaluate the novel tool's performance with different parameters.In this test step, the stochastic strategy is used for strong classifier generation.That means system would use DIWC-1, DIWC-2, or DIWC-3 randomly.For ease of view, the logarithms of the documents numbers are used to draw the curve.
As shown in the chart above, the time consumption increased when the number of neighbors or groups increased.Note that logarithmic coordinates are used in the figure, so time consumption increased significantly with the change of N g and k.Therefore, adjust the number of neighbors and groups adaptively has great significance to improve system's efficiency in the conditions of guaranteeing the performance.In random k mode, system's training time is longer than k 3 mode and shorter than k 5 mode.Whether the efficiency of the system is higher than k 4 mode mostly depends on the number of groups, the number of documents, the size of each document, and document types.
To compare AGkNN-AdaBoost whose strong classifier is based on DIWC-1, DIWC-2, and DIWC-3 with other classic categorization algorithm, we designed experimental control groups including our novel algorithm, SVM, neural networks, and naïve Bayes.Similar to the former part, 5 thousands of documents each text's size is about 2 KB downloaded from standard copara are used for the comparison.The result is shown in Figure 9.
Time consumption will change with parameters and the way which combined strong classifiers.The training time of DIWC-1, DIWC-2, and DIWC-3 with different size of training set is tested as shown in Figure 9.The red dashed line represents DIWC-1, blue dash-dotted line represents DIWC-2, and brown dotted line represents DIWC-3.
Figure 9 reveals that time consumption increases fast when the training set becomes larger.In addition, more training time is needed when using more complex way to integrate AdaBoost is a big algorithm family.We choose most classic and most efficient algorithms-original AdaBoost, AdaBoost.M1, AdaBoost.MR, and AdaBoost.ECC to evaluate the runtime complexity level of our novel algorithms proposed in this paper.We used the same training set as the former experiment, and the result is shown in Figure 10.
It is clearly shown in Figure 10 that AGkNN-AdaBoost has higher efficiency than original AdaBoost and AdaBoost.M1.Moreover, AGkNN-AdaBoost using DIWC-1, DIWC-2 and DIWC-3 as its strong classifier construction strategy makes them all spend training time equal to or less than AdaBoost.MR and AdaBoost.ECC-the leader of the efficiency in AdaBoost family 32 .That is because the adaptive grouping mechanism can better fit It should be noted that the performance difference of efficiency between AGkNN-AdaBoost and AdaBoost.ECC is not obvious.The main reason is that the advanced reweight process of DIWC-3 spends a lot of time.However, AGkNN-AdaBoost still has advantages compared with AdaBoost.ECC, because AGkNN-AdaBoost with DIWC-1 and DIWC-2 has significantly lower runtime complexity.What's more, the precision of AGkNN-AdaBoost will be proved better than AdaBoost.ECC no matter DIWC-1, DIWC-2, or DIWC-3 is used.

Performance Comparison and Analysis
Experiment is made to evaluate performance of the system.Chinese news corpus support by Sogou Labs 33 is used as the training set and test set.kernel conditional random fields 34 KCRFs are used for preprocessing word segmentation, feature extraction, and representation the documents.The corpus can be divided into six categories-economics, politics, sports, weather report, entertainments, and culture.20000 documents are randomly selected as the training samples and 10000 documents are randomly selected as the test texts in each category.Experimental results of system's precision, recall, and F 1 -measure with comparative data 35-37 as shown in Tables 1, 2, 3, and 4.
As shown in the above-mentioned tables, AGkNN-AdaBoost has better performance than other text categorization algorithms.The performance of AGkNN-AdaBoost is far beyond naïve Bayes, neural networks, and decision tree.In addition, the novel classification tool has better performance than other AdaBoost family members.Spatially strong classifiers' and formal phrases which lead documents in this topic to have more complex features and the feature space of them possibly are extremely sparse.

Conclusion
An improved boosting algorithm based on k-nearest neighbors is proposed in this paper.
It uses an adaptive group-based kNN categorization algorithm as basis classifiers and combines them in a double iterative weighed cascading method which contains three alternative modes.The strong classifiers are also modified for better satisfying multiclass classification tasks.The AGkNN-AdaBoost algorithm is implemented in a text categorization system, and several experiments are made.Experimental results shows that the algorithm proposed in this paper has higher precision, recall, and robustness than traditional AdaBoost.Furthermore, the time and computational consumption of AGkNN-AdaBoost are lower than many other categorization tools which are not limited to AdaBoost family algorithms.Therefore the algorithm proposed in former sections is quite a useful tool in text categorization, including the Chinese TC problems.However, support vector machine as one of the best classification algorithm and its usage as weak classifier combined by ideas which are similar with DIWC is a virgin land in text categorization.Moreover, there is space for improving the accuracy and efficiency of AGkNN-AdaBoost, and the performance of AGkNN-AdaBoost in other classification tasks such as image processing, speech categorization, and writer identification should be evaluate.These will be undertaken as future works on this topic.

Figure 2 :
Figure 2: An instance of TC system.

1 1 4
Start: initialize documents weights w d i and weak classifier weights w c j . 2 Training the first classifier c 1 with first sample documents subset s 1 , mark the set of documents which is misclassified by c 1 in s 1 as e 1 .3Loop: training c i with s i and e i−Calculation: calculating weights of basis classifiers according to the first round of loops trainings .

Figure 9 :
Figure 9: Training time of different algorithms.

Figure 10 :
Figure 10: Time consumption of the AdaBoost family.
Select α t to greedily minimize Z t α in each step.
t x is a basis classifier, α t is a coefficient, and H x is the final strong classifier.Given the training documents and category labels x 1 , y 1 , x 2 , y 2 , . . ., x m , y m , x i ∈ X, and y i ±1.The strong classifier can be constructed as 26 Initialize weight D 1 i 1/m, for t 1, 2, . . ., T. 1 Select a weak classifier with the smallest weighted error: h t arg min 2 Prerequisite: ε t < 1/2, otherwise stop.3 Upper bounded ε t by ε t H ≤ T t 1 Z t , where Z t is a normalization factor.4

Table 4 :
Average performance of algorithms.