A Semisupervised Cascade Classification Algorithm

. Classification is one of the most important tasks of data mining techniques, which have been adopted by several modern applications. The shortage of enough labeled data in the majority of these applications has shifted the interest towards using semisupervised methods. Under such schemes, the use of collected unlabeled data combined with a clearly smaller set of labeled examples leads to similar or even better classification accuracy against supervised algorithms, which use labeled examples exclusively during the training phase. A novel approach for increasing semisupervised classification using Cascade Classifier technique is presented in this paper. The main characteristic of Cascade Classifier strategy is the use of a base classifier for increasing the feature space by adding either the predicted class or the probability class distribution of the initial data. The classifier of the second level is supplied with the new dataset and extracts the decision for each instance. In this work, a self-trained NB ∇ C4.5 classifier algorithm is presented, which combines the characteristics of Naive Bayes as a base classifier and the speed of C4.5 for final classification. We performed an in-depth comparison with other well-known semisupervised classification methods on standard benchmark datasets and we finally reached to the point that the presented technique has better accuracy in most cases.


Introduction
Pattern recognition and data mining from massively collected datasets have attracted extensive research the last decades.More and more recommendations systems from many scientific domains demand high quality information for providing accurate predictions.The most common procedure that is applied in such cases is the classification task.The aim of classification is to build a model, according to the base classifier that has been preferred each time, for matching each tested instance into predefined and distinct classes with the highest possible accuracy under any time restrictions.The nature of the classes depends on the scientific domain that each examined problem belongs to.Moreover, the number of the classes affects the expected classification accuracy and also determines how explanatory the prediction can be.For example, a simplified problem could include some data for deciding if a patient suffers from one disease or not.This problem can be characterized as a binary problem, since the unique two classes are "Yes" and "No."On the other hand, in a more in-depth analysis problem, the classes could represent all the different diseases that a patient may suffer and the prediction should be in this case based on more extensive and clearly more time-consuming procedures.The two important queries that arise from examples like the previous one are firstly how the classification task can be accurate when much available data do not exist and secondly how the processing time could be eliminated in order to be deemed acceptable in relation to the current computational resources.
The traditional division of classification theory contains the supervised and the unsupervised approaches.According to the former, all the available categorized data are used for building an appropriate classification model.Consequently, provided an initial dataset () is given along with a welldefined vector of classes , which can be described as  = {class 1 , class 2 , . . ., class  } in the generalized case of  distinct classes, supervised methods build a mapping () for assigning the most probable class to each instance of .Another variant of the same algorithms is the computation of conditional distribution probability vectors, whose dimension is ( × 1) and each value depicts the probability of th instance for belonging to the th class  (target class = class  | instance  ) with 1 ≤  ≤ .The latter approach is determined by the lack of knowledge of neither the categorization of the provided data nor the number of the classes.On the contrary, these algorithms try to generate useful information from the available data usually using some stochastic assumptions for acquiring some initial information.The conditions under such algorithms operate and explain the inferior performance in respect to the supervised learning algorithms.However, a new group of methods have been suggested recently [1], which are commonly called semisupervised learning (SSL) algorithms and are considered as a mixture of both supervised and unsupervised algorithms.The merging of the properties from the original approaches and the generation of new semisupervised algorithms are still a field which has attracted the interest of many researchers.However, the common property of all these approaches is the search for an effective combination of few labeled data with much more unlabeled ones for increasing both their accuracy rate and learning ability.Moreover, the improved accuracy performance of these algorithms against their ancestors which has been noticed along many known scientific fields has highlighted semisupervised techniques as a great tool to the machine learning community.
The attribute which defines the quantitative relation between labeled and unlabeled data is called labeled ratio and is computed as the fraction of the number of labeled examples divided by the number of all the examples.The abbreviation of labeled ratio would be  in the following text.The dependence between parameter  and the performance of different algorithms is examined in depth in Section 4.
Taking into account that many practical applications and machine learning concepts suffer from practical difficulties on collecting enough reliable data in relation to the unlabeled examples, which in domains like Web mining [2], Speaker Recognition, and Object Detection [3] are plenty, the values of  tend to be small.Judging by the fact that in the most of these domains either there are not any automated procedures for labeling correctly new examples or the cost for such procedures is not always affordable, the solution of semisupervised concept seems to meet the current needs of reallife scenarios [4].
Triguero et al. [5] introduced a taxonomy of self-labeled techniques for semisupervised learning, having made an extensive study about the applicability of these methods to classification concepts.Some of the properties that were used for establishing this taxonomy are the influence of the number of labeled examples and the time requirements of each algorithm.Both of these play a cardinal role in real-life problems.Another generic review of semisupervised learning algorithms has been made by Schwenker and Trentin recently [6].Concerning the classification concept, they refer to four different categories of algorithms whose behavior is based on one of the following methods: (i) incremental techniques, (ii) generative models, (iii) support vector machines, (iv) graphs.
Schwenker and Trentin [6] introduced the terms of partially or weakly supervised learning and indicated some promising challenges of this scientific domain.
The largest family of algorithms for learning tasks, such as classification, is the ensemble learning schemes.The basic idea behind this category is the combination of multiple classifiers for improving the average behavior and the robustness of any of the involved single classifiers.Another benefit is the avoiding of lengthy tuning procedures or retuning of a single classifier for each different dataset.The two different architectures that ensemble classifiers can meet are the sequential or cascade architecture and the parallel.Voting, which uses Bayesian learning theory, and Stacking, which learns from predictions of level-0 classifiers via a single metaclassifier, are the most common schemes.Cascade Generalization was proposed in 1997 by Gama and Bradzil [7] and it follows the theory of Stacked Generalization.According to this, provided a dataset which is described by a raw feature vector of dimensions ( × 1), we could define  levels with  ≥ 2. In each level, one or more chosen classifiers examine the dataset independently and output their prediction.These predictions are added to the initial dataset and are passed to the next level.This chain of classifiers is symbolized as follows: As a result, the classifier of the last level is supplied with an enlarged dataset and its output is clearly guided and affected by the lower levels decisions.This strategy is called incremental batch learning.Two different schemes were suggested in this work and are compared with many different algorithms.They also present some explanatory examples and speak about the syntactic formalism of the new feature spaces that are generated.It is important to mention that Cascade Classifiers theory has been used only for supervised concepts [8][9][10] until now.
The aim of our work was to present a self-trained Cascade Classifier algorithm and compare it with other wellknown semisupervised classification methods on standard benchmark datasets.Since time requirements for real-life applications are considered and respecting the increased complexity of complex cascade chains, we selected the minimum amount of levels for completing a Cascade Classifier.At the same time we combined a Bayesian (NB) [11] and a decision tree classifier (C4.5) [12] for inducing fast response under an ensemble scheme.Besides the examination of classification accuracy rate, we performed statistical comparisons of the proposed method (self-trained (NB∇C4.5))with other algorithms and represented an illustrative visualization for recording the average accuracy of each algorithm against the others for different labeled ratios.Our developed technique presented better accuracy in most cases and a better overall performance in different scenarios, rendering this algorithm as a robust tool.For the proposed classifier, its specific chain is depicted in Figure 1.
The rest of this work is organized as follows.In Section 2, a brief description of the semisupervised classification techniques is provided.In Section 3, a presentation of the proposed algorithm is taking place.In Section 4, there are the  results of the comparison of the proposed algorithm with other well-known semisupervised classification methods on standard benchmark datasets.Finally, some conclusion remarks and future research examples are presented in Section 5.In the Appendix, there is the corresponding link for a tool that implements our proposed algorithm.

Semisupervised Techniques
Many variants of self-labeled techniques have been proposed over the last years, because many practical problems can be simulated under the scenario that both labeled and unlabeled data exist [13].Any choice of the labeled ratio is allowed and this leads to large enough number of possible experiments or simulations.The only restriction is the limited number of labeled examples that are provided in some scientific fields.
Even worse, the available data may not describe some or all the views of a problem sufficiently.For example, in a noise retrieval problem, the detection of the noisy segments should be described both in time and in frequency domain.But, the absence of labeled data either in one domain or in both domains would affect the performance of the selflabeled method.Some methods, such as boosting, tackle these problems but do not manage always to affect the learned hypotheses, especially in extreme situations.The most promising strategy for eliminating such phenomena has been proposed by Triguero et al. [14].According to this work, the generation of new synthetic labeled data seems capable of fulfilling the labeled data distribution.
Having assured that there are a few labeled examples, implementation of self-labeled methods has shown promising results.In particular, the information that is exploited by the unlabeled data when they are combined with the labeled subset has been proved adequate for modifying the learned hypothesis obtained in comparison with the supervised scenario.Having chosen the labeled ratio, all the available data () split into two distinct subsets: the Labeled () and the Unlabeled () set, with  =  ∪ .The generic representation of the examples that are included in each of these subsets is   = {(Feature set) | class}.When  equals , then the class is known and these examples contribute to training phase.On the other hand, for the case when  equal , the class is totally unknown.Depending on the theory of each self-labeled method, these subsets interact with various ways.There could be a movement of examples from  to , or an iterative reweighting of the  examples for classifying, with higher accuracy, the examples of , or even the generation of multiple subsets of  for classifying the 's examples.
The most well-known method of semisupervised techniques is self-training method.It is usually called a wrapper method and its simplicity has attracted many researchers from different domains.Its main asset is the shortage of restrictions about the labeled data that are given.This happens because it acts with the assumption that in each iteration the information that is extracted is correct and sufficient to lead itself to better results.The whole function of this scheme as it concerns the semisupervised classification (SSC) task can be separated to five different steps [3].First of all, there is an initial step during which all the stopping criteria are described and are set for the rest of the procedure.Also the accuracy threshold (AccT) for the accepted examples is set.After this, the original dataset () is split into  and  subsets, as it has been mentioned previously.Thirdly, a classifier of user's choice is chosen and is trained with the examples of  subset, which have been chosen randomly from .During the fourth step, the classification of unlabeled examples takes place and then a procedure of assessment follows.More specifically, each point that scored a probability value over or equal to the specific threshold ((  ) ≥ AccT) is considered informative enough for enhancing the learning ability of the algorithm in the next training phases.At the final step, all these highvalued examples are subtracted from subset  and are inserted to the initial training set (), increasing in this way its cardinality under the assumption that the new examples are also correctly labeled.These five steps compose a complete step of the simplified self-training scheme.Retraining of the classifier is done using the new enlarged training set until stopping criteria are satisfied.Although self-training has met great success in many practical problems, observation of fluctuations to its performance is still often compared with similar supervised algorithms.A good explanation is the fact that, during the training phase of the former algorithms, some of the unlabeled examples will not get labeled, since the termination of the algorithm will have been preceded [6].This fact means that a part of the total information provided through the dataset will not be exploited under this scheme.
Another critical point for which self-training scheme has to be assessed is the principle of high confidence that governs itself and may lead to the opposite results [15].Suppose that the quality of the initial labeled data which will be provided by a human-expert or by a specialized mining tool is poor; the final prediction will also be distorted, not providing a totally correct set of labeled data.For this reason, some statistical tests or other criteria have been added so as to reassure that the self-training scheme will continue to perform well, forming in this way new algorithms.An algorithm which derives from the primary self-training scheme is the Self-Training with Editing (SETRED) [16].The prominent modification is the establishing of a restriction related to the acceptance or the rejection of the examples that the algorithm evaluates as trustworthy.The origin of this restriction stems from the graph theory.Therefore, a neighborhood graph in -dimensional feature space is being built, whereas  is the dimension of the feature row vector.All the candidate examples for being appended to the initial training set are being filtered through a hypothesis test and only these that passed successfully that test are finally added to the  subset before the end of each iteration.The generic performance of SETRED over many datasets has certified the need of intermediate filtering stages into self-labeled methods.
Cotraining method is another popular generalized scheme under which a family of semisupervised algorithms has been formed utilizing more than one view of any problem that needs to be mined for any learning task.Sun [17] reviews theories in order to describe the characteristics of multiview learning.The basic idea that is examined is the chance for enhancing the classification accuracy by obtaining examples from multiple sources which are characterized from different feature vectors.Two of the theories that are under scrutiny by Sun are the Canonical Correlation Analysis (CCA), first proposed by Hotelling et al. [18,19], for the case of two views and the effectiveness of cotraining method.The integration of multiple views and their editing is also being reviewed by Xu et al. [19] who are trying to organize the different strategies that have been proposed about multiview learning.Another equally important issue that is discussed is the potentiality of constructing multiple views and their evaluation.
Cotraining was introduced by Blum and Mitchell [20].In contrast with self-training method that respects single-view learning and demands one feature vector independently of its physical consistency, cotraining method requires two distinct feature vectors in which each one represents a different view [13,14,17,21].Similar to self-training theory, for each different view, a weak-classifier is trained over the corresponding training set.The aim is again to select the appropriate examples that can be considered reliable enough for the tested case and enlarge the initial training set with them.All this procedure is executed alternately for maximizing the mutual agreement on the two feature vectors.The stopping criteria also are slightly modified for capturing the existence of the two classifiers.Even though this assumption seems to be more effective when these two subsets of features are obtained from different views with some clear natural meaning, Nigam and Ghani [22] showed experimentally that cotraining scheme can perform well enough even if the different views come from random splits of an original feature vector.Moreover, the concatenation of many feature sets and their applications to single-view methods can cause overfitting phenomena, especially in cases in which the initial dataset is characterized by low labeled ratios.
The power of cotraining method as a function of the size of the initial training set has been examined by Didaci et al. [23].The recordings of their study showed that cotraining managed to accomplish high quality results even in cases where the algorithm was provided with very few examples per class.This does not mean that there are not any weak examples of cotraining scheme.Du et al. [24], after having executed a big trial of experiments, reached to the point that running semisupervised algorithms based on small labeled training sets cannot ensure the accuracy of multiview consideration assumptions.A lot of variants of basic cotraining scheme have been developed trying to improve cotraining performance.Furthermore, since the hazard of false acceptance of unlabeled examples still remains possible and may deteriorate the classification accuracy, Multiple Kernel Learning (MKL) and Subspace learning-based theory have been used to filter the examples for higher reliability [19].One approach according to the latter theory was proposed by Sun and Jin [25] who trust only the examples that satisfy both the high confidence of the used classifier and the CCA's restrictions for being finally appended to the initial training set.Another algorithm with great success over many domains is the Democratic-Co [26] that also follows the multiview theory but from another aspect.Instead of asking for more than one view of the data, it uses multiple algorithms for producing the necessary information and adopts a voted majority process for the final decision.
Ensemble classifiers or committees of classifiers can also be used under semisupervised scheme for exploiting the power of more than one weak learner [27].The key point in this situation is the diversity of the included classifiers.Many artificial tactics have been presented for injecting diversity on a group of classifiers when the original diversity does not reach the expected levels.Bagging is maybe the most popular strategy for achieving such results.In this case, each base classifier is produced by random sample of the initial training set.A great research on the field of ensemble classifiers' properties has been made by Kuncheva and her partners over the last fifteen years [28,29].Jiang et al. [30] introduced a hybrid method which weights the participation of two different classifiers for pushing the total accuracy against the atomic behavior of each participant classifier.It is worth mentioning that the combined classifiers were Naive Bayes (NB) and Support Vector Machine (SVM) which are a generative classifier and a discriminative classifier, respectively.A representative approach that uses ensemble theory is the TriTraining scheme that does not require any redundant view for being applied [31].It is based on the decisions of three classifiers that classify each tested instance according to majority voting.The training procedure of each classifier is taking place over a different sampled subset of .An enhanced variant of TriTraining scenario is the improved TriTraining algorithm (im-tri-training) [32].Specific weaknesses of the default scheme, such as unsuitable error estimation, are tackled and eliminated for achieving a more robust behavior.
Following similar theories, Li and Zhou [15] developed CoForest algorithm, in which a number of Random Trees are trained on bootstrap data from the dataset.This technique has inherited the asset of ensemble methods about the robustness even if the number of the available labeled examples is reduced.One serious reason why this behavior is generalized over multiple datasets is the utilization of Random Tree classifier for random samples of the collected labeled data.Majority voting extracts the final prediction.ADE-CoForest [33] originated from the previous algorithm.Its powerful asset is an embedded editing technique for preventing misclassified examples to impair its learning ability.The solution of distance metrics has also been applied with cotraining scheme and one of the most well-defined products was the cotraining by committee which has been proposed by Hady and Schwenker [34].Three ensemble methods were used (Co-Bag, CoAdaBoost, and CoRSM) for testing these metrics.This method does not reclaim the multiview concept and remains a single-view method.
Wang et al. [35] developed Rasco algorithm (Random Subspace Method for co-training) which follows the ensemble theory and tries to produce diversity among its base classifiers.According to this algorithm, a random split of the feature vector is suitable for training a group of different learners.After having completed the training of each base classifier on a randomly chosen subset of the initial feature vector, the enlargement of the training set by the unlabeled examples along with their class assignment is filtered from a number of decisions of the base learners.An extension of the original Rasco theory is Rel-Rasco [36].The elimination of the inaccurate produced learners by Rasco method motivated Yaslan and Cataltepe to suggest an algorithm which produces relevant random subspaces and then does semisupervised ensemble learning using those subspaces together with unlabeled data.

Proposed Algorithm
Cascade Generalization may be regarded as a special case of Stacking Generalization mainly due to the layered learning structure.Some aspects that make Cascade Generalization are the following: (i) All classifiers have access to the original attributes.
Any new attribute built at lower layers is considered exactly in the same way as any of the original attributes.The new attributes are categorical by using the predicted class or continuous by taking the form of a probability class distribution.
(ii) The goal of Cascade Generalization is to obtain a model that can use terms in the representation language of lower level classifiers.
Cascading classifiers is particularly useful for models that have highly combinatorial or counting rules (e.g., class 1 if exactly two features are negative, class 2 otherwise), which cannot be fitted without looking at all the interaction terms.Having cascading classifiers enables the successive stage to gradually approximate the combinatorial nature of the classification or to add interaction terms in classification algorithms that cannot express them in one stage.
In our case for the Cascade Generalization, the new attributes are first derived from the class prediction given by the Naive Bayes (NB) learner.This constructive step extends the representational language for the high level learner-C4.5.The presented ensemble can be symbolized as NB∇C4.5 and is described by pseudocode in Algorithm 1.
In this work, we propose a self-training method that uses the power of cascade ensemble for semisupervised tasks.The proposed algorithm (self-trained (NB∇C4.5)) is presented in Algorithm 2. The self-training process produces good results by using the more accurate class probabilities of NB∇C4.5 model for the unlabeled examples.
Concerning which data examples are removed from  and added to , the explanation is that if the probability of the most probable class exceeds the predefined threshold T, then this instance is assigned a label.In the proposed algorithm, experimental results that were performed by the authors showed that a good option for the threshold parameter is the value of 0.9, which gave decent results irrespective of the data set, as far as the classification accuracy is concerned.It was noticed that only a small amount of examples per class in each iteration meets the restriction above.For the implementation, we used the open-source environments of WEKA [37] and KEEL [5].

Experiments
The experiments are based on standard classification datasets taken from the KEEL-dataset repository [38] covering a wide range of scientific fields.These datasets have been partitioned using the 10-fold cross-validation procedure.For each generated fold, a given algorithm is trained with the examples contained in the rest of the folds (training partition) and then tested with the current fold.Each training partition is divided into two parts: labeled and unlabeled examples.In order to study the influence of the amount of labeled data, we examined four different ratios for dividing the training set: 10%, 20%, 30%, and 40%.Subsequently, we compared the proposed method with other state-of-the-art algorithms into the KEEL tool [38] such as self-training (C45) [3], self-training (NB) [5], SETRED [16], co-training (C45) [13], Democratic-Co [26], TriTraining (C45) [31], TriTraining (NB) [5], DE-TriTraining (C45) [39], DE-TriTraining (NB) [5], CoForest [15], Rasco (C45) [35], Rasco (NB) [35], Rel-Rasco (C4.5) [36], Rel-Rasco (NB) [36], Co-Bagging (C45) [34], Co-Bagging (NB) [5], and ADE-CoForest [33].For all tested algorithms, the default parameters of KEEL were used.The classification accuracy of each tested algorithm using 10%, 20%, 30%, and 40% as labeled ratio is presented in Tables 1, 2, 3, and 4, respectively.The best accuracy value among the different algorithms tested in each experiment is shown in bold style.For our experiments, we Here, we present only the best 10 of these algorithms, according to their classification accuracy.We also provide a more representative visualization of the average accuracy ability of the proposed algorithm in comparison with the remaining 18 algorithms, presented in Figure 2. In this figure, we have mapped each different ratio of labeled examples with different color and line format across a radar plot.
The illustration above depicts the association between the accuracy of the involved algorithms and the different labeled ratio's values.A noteworthy conclusion is that the increase of  (%) does not uniquely mean that the average accuracy of any algorithm will finally be increased.For instance, both versions of Rasco and Rel-Rasco did not manage to score a better accuracy between the 10% and 20% provided labeled examples.Similar irregularities are found in TriTraining (NB) and Co-Bagging (NB), where the choice of  = 30% caused better results than those of 40%.All these saturation phenomena can be avoided by visualizations like that in Figure 2. The full tables of comparisons can be found in supplemental excel file (see Supplementary Material available online at http://dx.doi.org/10.1155/2016/5919717).
A statistical comparison of the tested algorithms has been also applied to all the selected values of .In order to perform this task, Friedman test together with two similar post hoc statistical tests (Holm/Hochberg) described in [40] has been chosen.Concerning Friedman test, this is a nonparametric equivalent of the repeated-measures ANOVA.It produces a ranking of the algorithms for each dataset separately and compares the average ranks of the algorithms.The null hypothesis states that all the algorithms are equivalent, getting the same average ranking over any dataset.Holm procedure (1979) [41] was used in Demsar (2006) [42] for comparisons of multiple classifiers involving a control method.Hochberg Test (1988) [43] has more power than Holm's procedure, but the difference between them is not very notable when considering all pairwise comparisons.The results are presented in Tables 5 and 6.
Regarding the post hoc test, if the sorted -values increase rapidly enough, the procedures by Holm and Hochberg give identical answers.Judging by the produced results and, more especially, by the average accuracy on benchmark datasets, as Figure 2 depicts, and the ranking of the proposed algorithm in Friedman tests, as presented in Tables 5 and 6, the proposed algorithm gives better results among all the tested algorithms.This is due to better probability-based ranking and higher classification accuracy that are induced by the series combination of Naive Bayes and C4.5 classifiers.In this point we have to mention that the same algorithm was examined with the modification of the Level-1 classifier.To be more specific, we changed the output stage of Naive Bayes classifier so as to append the distribution vector to the feature vector of original dataset.The results show that there was a small deterioration concerning the average ability of the final algorithm.

Conclusions
It is promising to implement techniques that use both labeled and unlabeled examples in classification tasks.The shortage of available labeled data impairs the accuracy of learning process, since supervised learning methods cannot produce a learner with worthy accuracy.The strategy of combining classifiers using a cascade methodology seems to improve the average behavior of the final ensemble classifier under self-training scheme.In this work, a self-trained (NB∇C4.5)algorithm has been proposed.We performed a comparison with other well-known semisupervised learning methods on standard benchmark datasets and the presented technique had better accuracy in most of the tested datasets.Due to the encouraging results obtained from these experiments, one can expect that the proposed technique can be applied to real classification tasks giving slightly better accuracy than the traditional semisupervised approaches.
The fact that there is no previous use of Cascade Classifiers under self-labeled techniques means that a new family of algorithms can be developed for achieving enhanced learning effectiveness.The choice of the classifiers in each step as well as the number of levels () should be selected with respect to the time restrictions that are usually imposed by real-life problems.
In spite of these results, no general method will work always.The main drawback of the semisupervised schemes is the needed time in the training phase.Some techniques that could enhance this property by saving both valuable operation time and computational resources are the feature selection algorithms that search for a subset of relevant features by removing the less informative features of the initial set [44].

Loop for a number of iterations (MaxIter is equal to 40 for our implementation)
Use NB∇C4.5 classifier to select the examples with Most Confident Predictions per iteration ( MCP ) Remove  MCP from  and add them to  In each iteration a few examples per class are removed from  and added to  Re-train NB∇C4.5 as base model on new enlarged  Output: Use NB∇C4.5 trained on  to predict class labels of the test cases.
= probability that  belongs to   according to    = (, predicted class of :  index(max(  )) )  =  ⋁{  } Return  end where:(i)  is a learning algorithm,  is the number of classes (ii)  is the dataset from which  builds a model,   is a dataset to which the model learnt from  may be applied Algorithm 1: Cascading NB and C4.5 (NB∇C4.5).

Table 5 :
Friedman ranking of the algorithms in all tested labeled ratios.

Table 6 :
Rankings of the algorithms in all tested labeled ratios according to Holm/Hochberg (alpha = 0.05).