Self-Trained LMT for Semisupervised Learning

The most important asset of semisupervised classification methods is the use of available unlabeled data combined with a clearly smaller set of labeled examples, so as to increase the classification accuracy compared with the default procedure of supervised methods, which on the other hand use only the labeled data during the training phase. Both the absence of automated mechanisms that produce labeled data and the high cost of needed human effort for completing the procedure of labelization in several scientific domains rise the need for semisupervised methods which counterbalance this phenomenon. In this work, a self-trained Logistic Model Trees (LMT) algorithm is presented, which combines the characteristics of Logistic Trees under the scenario of poor available labeled data. We performed an in depth comparison with other well-known semisupervised classification methods on standard benchmark datasets and we finally reached to the point that the presented technique had better accuracy in most cases.


Introduction
Classification task is an integral part of machine learning algorithms, trying to separate and thereafter match each tested pattern or object into distinct categories or classes. The classes vary according to the application domain of each problem. For example, the classes could represent the different origin among the tested speakers in a Speech Identification problem or different objects at several pictures in various backgrounds in a Pattern Recognition problem.
The default scenario of classification is the supervised, in which all the available labeled data are used in order to build a classification model. Using the information of the labeled data, the trained supervised classification model will assign to each new instance a class label. Unsupervised techniques can be also used for the same problems. The main characteristic of unsupervised techniques is the lack of need for labeled data [1]. However, the lack of the classes downgrades the performance of unsupervised algorithms, respectively. The most recently proposed family of methods is commonly called semisupervised learning (SSL) algorithms and is generated by a direct combination of the previous strategies [2]. Friedhelm and Edmondo [3] proposed in 2014 a categorization of semisupervised learning algorithms. They used the title of Partially Supervised Learning (PSL) for mentioning of these algorithms. They also referred to the phase of training semisupervised algorithms as a weak supervision, since only a part of the whole information is provided. Trying to explain all the new matters that have arisen from PSL task, Friedhelm and Edmondo [3] review the most prominent directions of research that are related to this domain.
Sun [4] reviews theories in order to describe the characteristics of multiview learning. Under this concept, any set of features, or, more generally, any possible information gathered which is related to the dataset, can potentially improve the classification accuracy. Moreover, Triguero et al. [5] made an in depth study of self-labeled techniques, mainly focused on the matter of classification. Based on some specific properties, which seem to be quite representative of and objective for the majority of real applications, they proposed a taxonomy for semisupervised classification (SSC) methods. One of the findings in this work is the shortage of multilearning approaches introduced with self-training method.

Computational Intelligence and Neuroscience
In many application domains, the labeling of the training instances requires high cost in labor and/or time [6]. The major asset of semisupervised algorithms is that they overcome the need for collecting and distinguishing large amounts of data in fields like text mining, speech recognition, object detection from images [7], and so forth, allowing application of such methods in a variety of contexts. Moreover, the increased accuracy that is provided by these methods along with the automated learning of most possible patterns from datasets renders semisupervised techniques as a great tool to the machine learning community [8]. Using SSC methods, the essential effort from human experts of labeling instances tends to be reduced dramatically, especially in reallife scenarios [9].
In particular, SSC methods demand only a small proportion of the whole amount of data to be labeled for accomplishing their task. This attribute is widely known as labeled ratio and is usually provided in percentage values: Number of labeled dinstances Number of all the instances . (1) Having chosen the labeled ratio, all the available data split into two different subsets: the labeled ( ) and the unlabeled ( ) set. The mathematic expression of the instances that are included in each of these subsets is as follows: Dataset's instances Tanha et al. [10] suggested that using decision tree classifiers as base classifiers along with self-training algorithm is not quite effective as semisupervised learning is concerned mainly due to low performance when decision tree classifiers compute probability estimations for their predictions. However, decision trees are not demanding in training time and produce easily comprehensive models. A series of modifications have been proposed so as to refrain from using the simplistic proportion distribution at the leaves of a pruned decision tree [11]. Laplacian correction and grafted decision trees are some of them [10]. Torgo [12] also made a thorough study of tree-based regression models and focused on generation of tree models and on pruning by tree selection.
The aim of our work was to present a self-trained Logistic Model Tree (LMT) algorithm and compare it with other wellknown semisupervised classification methods on standard benchmark datasets. To achieve this, we performed statistical comparisons of the proposed method with other algorithms and represented an illustrative visualization for recording the behavior of each algorithm against the others. Our proposed technique presented higher accuracy in most cases and a better overall performance in different scenarios, rendering this algorithm as a robust tool.
In Section 2, a brief description of the semisupervised classification techniques is provided. In Section 3, the proposed algorithm is presented. In Section 4, there are the results of the comparison of the proposed algorithm with other well-known semisupervised classification methods on standard benchmark datasets. Finally, some conclusion remarks and future research points are presented in Section 5.

Semisupervised Techniques
Self-training is usually called a wrapper method that constitutes a great tool for semisupervised learning tasks. It is a simple scheme based on four stages [7]. In the first one, a classifier of our choice is chosen and is trained with a small amount of labeled data, which have been chosen randomly from the initial dataset. During the second phase, the classification of unlabeled instances takes place and afterwards a procedure of assessment follows. More specifically, each instance that has achieved a probability value over a defined threshold is considered enough reliable to be added to the training set for the following training phases. Finally, these instances are added to the initial training set, increasing in this way its robustness. All these phases constitute a complete step of the algorithm. Re-training of the classifier is done using the new enlarged training set until stopping criteria are satisfied. Self-training has been proven to perform with great success in many real-life scenarios, even though misclassified instances could occur due to lack of specific assumptions. An important reason why PSL techniques' performance may fluctuate compared with supervised algorithms' performance is the fact that, during the training phase of the former, some of the unlabeled examples will not get labelized, since the termination of the algorithm will have been preceded [3]. This fact means that a part of the total information provided through the dataset will not be exploited under this scheme.
Self-Training with Editing (SETRED) method is a modified approach to self-training proposed by Li and Zhou [13]. Their principal improvement in relation to the basic selftraining scheme is the different tackle of misclassified examples which come from the unlabeled set and may incorrectly be merged with the original train set, pushing in this way the performance of the algorithm in inferior level. In order to reduce these occasions, they build a neighborhood graph in -dimensional feature space, whereas is the dimension of the feature vector ( × 1). By evaluating a hypothesis test, they finally discard any example whose output of the test was negative.
Cotraining is an equally important scheme that can be considered as a different variant of self-training technique [14]. Its main approach is that the feature space can be exploited with a different way other than combining all its elements. Under this assumption, which keeps up with the multiview learning, cotraining algorithm assumes that, by dividing the feature space into two separate categories, it is more effective to predict the unlabeled instances each time [15]. This assumption seems to be more realistic when the newly formed categories represent a different view of the dataset. Since the cotraining algorithm belongs to the family of self-training schemes, its algorithmic phases are similar to the previously referred ones, under the restriction of the existence of two independent feature vectors for each instance. In the work of Didaci et al. [16], the relation between Computational Intelligence and Neuroscience 3 the performance of cotraining and the size of the labeled training set was examined and their results showed that high performance was achieved even in cases where the algorithm was provided with very few instances per class. However, Du et al. [17], based on an adequate number of experiments, came to the conclusion that relying on small labeled training sets cannot ensure the accuracy of multiview consideration assumptions. In order to exclude the insertion of misclassified instances into the training set at the end of each iteration, several approaches have been proposed. Sun and Jin [18] filtered the predictions of cotraining classifiers with Canonical Correlation Analysis [4]. By applying CCA on paired datasets, the similarities between unlabeled examples of test set and initial train set were calculated in an effective way and only those instances that satisfied CCA's restrictions were inserted into the initial training set.
Wang et al. [19] proposed the usage of some distance metric, which examines the probabilities of belonging to a class between labeled and unlabeled examples. If two examples have the same class probability value, the metric that has been defined by this scenario will boost the example with the smaller distance, to be selected with a higher possibility. Another technique for separating with higher accuracy the predictions of a semisupervised scheme is the combination of more than one classifier. Jiang et al. [20] introduced a hybrid method which combines the predictions of two different types of classifiers for exploiting their different characteristics. The first one is Naive Bayes (NB), which is a generative classifier, and the second is Support Vector Machine (SVM), which is a discriminative classifier. The final prediction is controlled by a parameter which controls the weights between the two classifiers. A review of other similar hybrid methods is also presented in [20]. Moreover, Li and Zhou [6] suggested Co-Forest algorithm, in which a number of Random Trees are trained on bootstrap data from the dataset. As an ensemble method, its behavior is robust even if the number of the available labeled examples is reduced. The principal idea of this algorithm is the assignment of a few unlabeled examples to each Random Tree during the training period. Eventually, the final decision is produced by majority voting. An extension of this algorithm is ADE-Co-Forest which is based on a data editing technique in order to find and reject possibly problematic instances at the end of each iteration [21]. Within its framework, cotraining by committee has been proposed by Hady and Schwenker [22]. Based on the completely known instances of dataset, a starting committee was built. The ensemble methods that were used under this semisupervised scheme were named as CoBag (Bagging), CoAdaBoost (AdaBoost), and CoRSM (random subspace).
RASCO [23] does not consider any specific criterion for splitting the feature vectors, but it implements a random split, so as to train different learners. Following this strategy, the unlabeled data are getting labeled and added to the training set based on the combination of a number of decisions of the learners trained on different attribute splits. Rel-RASCO [24] algorithm instead of random feature subspaces generates relevant random subspaces using relevance scores of features which are obtained using the mutual information between features and class.
Tri-training scheme uses three classifiers using different bootstrap sample of the same dataset to label each unlabeled instance. If two of the three classifiers agree on the categorization of an instance, then this is considered to be labeled and is added to the training set [25]. An improved approach to tri-training scheme is improved tri-training algorithm (imtri-training) [26], in which some drawbacks of the original model such as unsuitable error estimation, excessively confined restriction, and deficiency of weight for labeled example and unlabeled example were eliminated. The idea of ensemble methods and majority voting has been also endorsed by Zhou and Goldman [27], who proposed democratic colearning. One really interesting asset of this algorithm is the enlarging of the training set of the classifier whose prediction was different with the final one after the voting phase. Sun and Zhang [28] suggested an ensemble of classifiers to be trained from multiple views. Subsequently, only the instances whose classification stemmed from consensus prediction of multiple classifiers are selected as the most confident in order to teach the other ensemble from the new one view.
Huang et al. [29] proposed a classification method based on Local Cluster Centers (CLCC). This algorithm tries to resolve problems that occur when the provided datasets consist of a few labeled training data and facilitates situations in which the labeling process may lead to misclassified instances. Another algorithm which uses selftraining scheme is aggregation pheromone density based semisupervised classification (APSSC) algorithm [30]. In this work, the corresponding property was used, as the name of algorithm defines, found in natural behavior of real ants. Actually, it performed well enough and offered promising results for solving real world problems which are related to the classification task. A combination of classifiers under selftraining scheme has been proposed by Wang et al. [31]. Their learning approach is named Self-Training Nearest Neighbor Rule using Cut Edges (SNNRCE) and its main advantage is the prevention of problematic examples from being added in each iteration to the initial labeled set through graph-based methods.

Proposed Algorithm
Our proposed algorithm combines self-training scheme with Logistic Model Tree (LMT) algorithm. A LMT is a decision tree that has linear regression models at its leaves to provide a piecewise linear regression model [34]. As in ordinary decision trees, a test on one of the features is associated with every inner node. For a nominal feature with values, the node has child nodes, and examples are sorted down one of the branches depending on their feature's value. For numerical features, the node has two child nodes and the test consists of comparisons of the feature value with a threshold. The LogitBoost algorithm is used to produce a linear regression model at every node in the tree [35]. The subsets encountered at lower levels in the tree become smaller and smaller; it can be preferable at some point to build a linear logistic model instead of calling the tree growing procedure recursively. There is strong evidence that building trees for very small datasets is usually not a good idea; it is better

Definitions:
-root of decision tree minNumInst -minimum number of instances at which a node is considered for splitting numBoostIter -fixed number of iterations of LogitBoost [32] CART -pruning algorithm [33] Steps: (1) Build logistic model at (2) Split data at according to the splitting criterion (3) Terminate the splitting phase when any of the stopping criterion (minNumInst or numBoostIter) is met (4) Prune the tree using CART-based algorithm Algorithm 2: LMT classifier.
to use simpler models (like logistic regression) [36]. As for simple decision trees, pruning is an essential part of the LMT algorithm. For LMT, sometimes a single leaf (a tree pruned back to the root) leads to the best generalization performance, which is seldom the case for simple decision trees [11]. Decision trees can generate estimates for the class membership probabilities: the probability for a particular class is just the fraction of the instances in the region which are labeled with that class. In terms of probability estimates, LMT outperforms all other simple decision trees and related algorithms included in the experiments [34]. In this work, we propose a self-training method that uses the power of LMT for semisupervised tasks. The proposed algorithm (selftrained LMT) is presented in Algorithm 1. The self-training process produces good results by using the more accurate class probabilities of LMT model for the unlabeled instances. When fitting the logistic regression functions at a node, LMT has to determine the number of LogitBoost iterations to run. Originally, this number was cross-validated at every node in the tree [34]. To save time, a heuristic that cross-validates the number only once and then uses this number at every node in the tree was used in our implementation. In [37], a similar process was used.
Removal of data points from to is based on estimation of class probabilities. If the probability of the most probable class exceeds the predefined threshold , then this instance is assigned a label. In the proposed algorithm, experimental results that were performed by the authors showed that a good option for the threshold parameter is the value of 0.9, which gave decent results irrespective of the dataset. It was noticed that only a small amount of instances per class in each iteration meets the restriction above.
Algorithm 2 describes briefly the main characteristics of LMT classifier and is focused on the points that distinguish the used classifier from the common decision tree algorithms.
For the implementation, we used the open-source environments of Weka [38] and KEEL [5]. In our implementation, minNumInst was set to 15 and numBoostIter was set to 10.
The classification accuracy of each tested algorithm using 10%, 20%, 30%, and 40% as labeled ratio is presented in Tables 1, 2, 3, and 4, respectively. The best accuracy value among the different algorithms tested in each experiment is shown in bold style. For our experiments, we used 52 datasets and all the above 22 algorithms, including Self-LMT. The full tables of comparisons can be found in http://www.math.upatras.gr/ ∼sotos/Self LMT Results.xlsx.
Here, we present only the best 10 of these algorithms, according to their classification accuracy. A short comment follows each experiment about the general behavior of the proposed algorithm in comparison with the most effective one of the rest. We also provide a more representative visualization of the average accuracy ability of the proposed algorithm in comparison with the rest 21 algorithms, presented in Figure 1. In this figure, we have mapped each different ratio of labeled instances with a different color across a radar plot. 6 Computational Intelligence and Neuroscience     In this experiment, self-trained LMT and Co-Forest presented 8 wins in an amount of 52 datasets, being followed by self-training (C45), cotraining (C45), and APSSC with 5 victories. Despite the low labeled ratio of instances, selftrained LMT managed to achieve the best average accuracy, assuring its robust behavior.
During the experiment of 20% labeled rate, self-trained LMT algorithm succeeded with 15 victories, while the next in victories' rank were Co-Forest algorithm with 5 and cotraining (SMO) with 4, respectively.
Similar to the previous experiment, self-trained LMT performed 17 wins out of 52 datasets, while cotraining (SMO) and Rel-Rasco (NB) achieved 7 and 6 best accuracy values, respectively.
Finally, self-trained LMT algorithm outperformed the rest of algorithms managing to score the best accuracy value in 19 different datasets, while democratic-co achieved 5 victories.
An interesting point which comes out from Figure 1 is that the increase of labeled ratio does not uniquely mean that the average accuracy of all the algorithms will also be enhanced. The example of cobagging (C45) depicts this phenomenon, since its accuracy rate was decreased when it was provided with 40% labeled ratio against the same rate in 30% labeled ratio scenario. Furthermore, many other algorithms, such as Rel-Rasco (NB), APSSC, and de-tri-training (SMO), did not manage to achieve a noteworthy improvement between 30% and 40% labeled ratio. Consequently, by providing the average accuracy of the tested algorithms on radar plots like this in Figure 1, we can extract useful information for comparing any subset of these algorithms as it concerns not only their accuracy but also their response to labeled ratio's increase, avoiding any saturation phenomena. In order to conduct comparisons among all algorithms considered in the study and the proposed algorithm for all the different labeled ratios, the results of Friedman test together with a post hoc statistical test described in [45] are presented in Tables 5, 6, 7, and 8. As a result, the proposed algorithm gives statistically better results among all the tested algorithms. This is due to better probability-based ranking and higher classification accuracy which allow selection of the high-confidence predictions in the selection step of self-training.

Conclusions
It is promising to implement techniques that use both labeled and unlabeled instances in classification tasks. The limited availability of labeled instances makes the learning process difficult, as supervised learning methods cannot produce a learner with worthy accuracy. LMT produces a single tree containing binary splits on numeric features, multiway splits on categorical ones, and logistic regression models at the leaves, and the algorithm ensures that only relevant features are included in the latter. The produced classifier is not so easy to interpret as a standard decision tree, but much more legible than an ensemble of classifiers or Kernel-based estimators.
In this work, a self-trained LMT algorithm has been proposed. We performed a comparison with other well-known semisupervised learning methods on standard benchmark datasets and the presented technique had better accuracy in most of the tested datasets. Due to the encouraging results obtained from these experiments, one can expect that the proposed technique can be applied to real classification tasks giving slightly better accuracy than the traditional semisupervised approaches.
In spite of these results, no general method will work always. The main drawback of the semisupervised schemes is the needed time in the training phase. Some techniques that could enhance this property by saving both valuable operation time and computational resources are the feature selection algorithms which search for a subset of relevant features by removing the less informative of the initial features [46]. Building Logistic Model Trees with the LMT algorithm are orders of magnitude slower than simple tree induction or using model trees for classification. Improving the computational efficiency of the method using feature selection could be an interesting field for further research.