Semi-supervised Predictive Clustering Trees for (Hierarchical) Multi-label Classification

,


Introduction
Over the past decade, there has been growing interest for machine learning methods that can use both labeled and unlabeled data for learning a classifcation model.Tis interest is motivated by two important factors: (i) the high cost of assigning labels for large datasets and domains where labeling requires complex procedures and/or tedious manual efort and (ii) the opportunity to achieve greater predictive performance by better estimation of the distribution of data in the descriptive space, given the large amount of freely available unlabeled data.While the former factor is only of practical relevance, the latter stems from the theoretical observation that the underlying marginal data distribution p(X) over the descriptive space X might contain information about the posterior distribution p(Y | X) for the prediction of the Y values in the target space.Te machine learning setting that takes into account both motivating factors is semi-supervised learning (SSL) [1].It accommodates the second factor by leveraging three (not independent) theoretical assumptions [2]: the smoothness assumption (if two samples x and x′ are close in the input space, their labels y and y ′ should be the same); the lowdensity separation assumption (the decision boundary should not cut through high-density areas of the input space); and the manifold assumption (data points on the same low-dimensional manifold should have the same label).
Nowadays, many semi-supervised learning approaches are available that tackle the classifcation in multiple domains, including object recognition in images [3], human speech recognition [4], protein 3D structure prediction [5], IoT data analysis [6], and spam fltering [7].However, only a few approaches are suited for the more complex tasks of multi-label classifcation (MLC) or hierarchical multi-label classifcation (HMLC), even though many applications (including the ones listed above) have an inherent complexity suitable for MLC and HMLC.Multi-label classifcation is a predictive machine learning task where the examples can be labeled with more than one of the labels from a predefned set of labels C. In this case, the output variable y takes values in a subset of the label set Y (i.e., y ∈ 2 C ). Hierarchical multi-label classifcation is a particular case of MLC, where the output space is structured so that it accommodates dependencies between labels.In particular, labels are organized in a hierarchy: an example labeled with label c is also labeled with all parent/superlabels of c. MLC and HMLC problems are encountered in various domains, such as text categorization, image classifcation, object/scene classifcation, gene function prediction, and prediction of compound toxicity [8].A common property for MLC and HMLC domains is that obtaining labeled examples is harder and more expensive compared to the classical (i.e., singlelabel) classifcation context.Tis contributes greatly to the need for developing SSL methods tailored for the MLC and HMLC tasks.
In the literature, only a few existing approaches tackle the problem of semi-supervised multi-label classifcation and hierarchical multi-label classifcation.Some examples include the work presented in [9][10][11] for the task of SSL MLC and that presented in [12] for SSL HMLC.However, all these methods adopt generative or optimization-based approaches, yielding complex and time-demanding learning processes, which produce noninterpretable models.On the contrary, in this paper, we propose an approach to SSL MLC/ HMLC based on predictive clustering trees (PCTs) [13].Te advantage of predictive clustering trees is manifold: (1) the learning phase is time efcient; (2) the SSL models are interpretable for both MLC and HMLC tasks; (3) the SSL models can take both quantitative and categorical variables into account; (4) PCTs can be combined into ensembles, such as random forests, to further improve their predictive performance; and (5) the hierarchical structure of tree-based models can naturally model the hierarchical structure of the output space in the HMLC task.
In this paper, we propose a method for MLC and HMLC that works in the SSL setting.It defnes a novel algorithm for learning predictive clustering trees by exploiting both the labeled and unlabeled data for MLC and HMLC tasks.In a nutshell, this is achieved by defning a new heuristic and prototype functions that take these specifcs into account.Moreover, the proposed method has a parameter that balances the contribution of the descriptive and the target/label part of the data (i.e., controlling the degree of supervision in the model learning process).Tis mechanism safeguards against performance degradation, compared to learning only from the labeled data.Furthermore, we propose learning ensembles of the semi-supervised predictive clustering trees to further boost their predictive performance.Te extensive experiments across 24 datasets from a variety of domains reveal that the proposed methods have better predictive performance compared to their supervised counterparts.
One of the explanations of the inner workings of the proposed methods is related to the interaction of semi-supervised learning with the label dependency of MLC and HMLC tasks.More specifcally, we investigate whether the smoothness assumption (and, indirectly, since they are not independent, the low-density and the manifold assumptions) holds in the MLC and HMLC contexts.Intuitively, better identifcation of the distribution of examples in the descriptive space (as performed in the semi-supervised learning setting) can lead to better exploitation of label dependency and/or label correlation in the output space, leading to improved predictive accuracy.To better explain this concept, let us consider the example reported in Figure 1.We can see that unlabeled examples provide useful information to better classify the examples in the classes "a" ("a" vs. not "a," see the vertical dashed lines) and "d" ("d" vs. not "d," see the horizontal dashed lines), especially in low-density regions.From Figure 1(b), we can also see that unlabeled examples reveal a higher correlation between the class labels "a" and "e" than between "a" and other class labels, such as "d."Tis is because "a" and "e" appear together in a region that is much denser than the region where "a" and "d" appear together.Such information, if exploited by the predictive model, can be used to better classify MLC examples.
In summary, the main contributions of this paper are as follows: (i) Novel semi-supervised methods based on predictive clustering trees and random forest ensembles able to deal with both MLC and HMLC tasks.(ii) A semi-supervised method able to produce interpretable MLC and HMLC models.(iii) A mechanism safeguarding the proposed method from the degradation of predictive performance.(iv) An extensive evaluation and analysis of the proposed method across 12 MLC and 12 HMLC datasets.
Te rest of the paper is structured as follows.In Section 2, we briefy describe the work in the literature that is related to the present paper.In Section 4, we describe the proposed solution, while in Section 5, we evaluate its performance on publicly available datasets and discuss the results.Finally, in Section 7, we present the conclusions of this work and outline possible directions for future research.

Related Work and Motivations
SSL MLC is a relatively recent topic in machine learning and data mining.One of the most prominent works in this research area is [9], where the main idea is to combine largemargin MLC with unsupervised subspace learning.Tis is done by jointly solving two problems: (1) learning a subspace representation of the labeled and unlabeled inputs and (2) learning a large-margin supervised multi-label classifer on the labeled part of the data.Te proposed algorithm works in a single optimization step, which results in a high-time complexity process.To alleviate the problem, the authors proposed a learning procedure which is based on subgradient search and coordinate descent.
In [10], the authors proposed an SSL MLC algorithm based on the optimization problem of estimating label concept compositions (label co-occurrence).Specifcally, the algorithm derives a closed-form solution to this 2 International Journal of Intelligent Systems optimization problem and then assigns label sets to the unlabeled instances in a transductive setting.
In [11], the authors proposed a deep generative model to describe the label generation process for the SSL MLC task.For this purpose, the generative model incorporates latent variables to describe the labeled/unlabeled data as well as the labeling process.A sequential inference model is then used to approximate the model posterior and infer the ground truth labels.Te same inference model is then used to predict the label of unlabeled instances.
More recently, in [14], the authors proposed a dual Relation Semi-Supervised Multi-Label Learning (DRML) approach which jointly explores the feature distribution and the label relation simultaneously.In this paper, a dualclassifer domain adaptation strategy is proposed to exploit both the feature distribution and the label relation between examples.Terefore, the optimization simultaneously takes into account instance-level relations across labeled and unlabeled samples in feature space and the relations across labels.Tis approach has been only applied in the image multi-label classifcation task.
In [15], the authors addressed the task of multi-label learning with incomplete labels, by combining the label imputation function and multi-label prediction function in a mutually benefcial manner.Specifcally, the proposed method conducts automatic label imputation within a lowrank and sparse matrix recovery framework while simultaneously performing vector-valued multi-label learning and exploiting unlabeled data with vector-valued manifold regularization.
Te semi-supervised multi-label learning task has also been investigated in the context of graph-structured data by incorporating the idea of label embedding to capture both network topology and higher-order multi-label correlations [16].In this work, the label embedding is generated along with the node embedding based on the topological structure to serve as the prototype center for each class.Moreover, the similarity of the label embedding and node embedding is used as a confdence vector to guide the label smoothing process, obtained by margin ranking optimization to learn the second-order relations between labels.
In [17], the authors derived an extension of the Manifold Regularization algorithm to multi-label classifcation in graph data.Tey then augmented the algorithm with a weighting strategy to allow diferential infuence on a model between instances having ground truth vs. induced labels.Terefore, the proposed approach includes three components: the graph construction, the manifold regularization with multiple labels, and the exploitation of a reliance weighting strategy.
All the previously mentioned works, although they tackle the SSL MLC problem, sufer from the common problem of not generating interpretable models.Tis is not the case with the method proposed in this paper, where the adoption of the PCT framework allows us to produce multi-label decision trees, which are directly interpretable and fast to learn (a preprint of this work has previously been published in [18]).Moreover, contrary to existing approaches, the approach we propose builds models by exploiting clustering.Tis allows us to take into account the smoothness assumption, both for the descriptive space and for the output space.Finally, the mentioned existing approaches cannot be directly used to impose limitations on the labels, and, therefore, cannot be directly used for the more complex task of HMLC.
As for the SSL HMLC task, the existing work in the literature is relatively limited.In [12], the authors extended the RAkEL system, initially developed for (supervised) MLC, to the SSL HMLC task, leading to three new methods, called HMC-SSBR, HMC-SSLP, and HMC-SSRAkEL.RAkEL is an ensemble-based wrapper method for solving MLC tasks using existing algorithms for multiclass classifcation.Te idea is to build the ensemble by providing a small random subset of k labels (organized as a label powerset) to each base model, learned by a multiclass classifer.Tis approach is also used in HMC-SSBR, HMC-SSLP, and HMC-SSRAkEL, which, therefore, are not based on clustering and cannot directly take into account the smoothness, the low-density, and the manifold assumptions.
In the more general context of semi-supervised structured output prediction, some approaches for multitarget regression also use predictive clustering trees.Tis is the case of the works in [19,20], where the idea is to learn predictive clustering trees by using both labeled and unlabeled examples.Te authors of [19] proposed a semi-supervised multitarget regression method based on the self-training approach with a random forest of predictive clustering trees.In self-training, a model is trained iteratively with its own most reliable predictions.Te authors of [20] extended multitarget regression PCTs by adapting the heuristics used for the construction of the trees, in order to consider both labeled and unlabeled examples.Both methods, however, do not tackle the classifcation tasks.

Background: Predictive Clustering Trees
Te predictive clustering trees (PCTs), presented in this paper for MLC and HMLC, are inspired by the work in [13].In that work, the splits in the tree are evaluated by considering both descriptive and target attributes.Te semisupervised decision trees proposed here have similarities to the ones in [13], with multiple diferences.First, Blockeel et al. [13] considered unlabeled examples only in tasks with primitive outputs, whereas we designed semi-supervised trees for structured outputs.Second, we established a parameter that allows varying degrees of supervision in the trees (i.e., how much the descriptive attributes infuence the evaluation of the splits).In this way, we can build supervised, semi-supervised, or unsupervised trees, dictated by the demands of the specifc dataset we are dealing with.
Te PCT framework (PCTs are implemented in the CLUS system [21] available for download at https://github.com/knowledge-technologies/clus) treats a decision tree as a hierarchically organized set of clusters, where the topmost cluster contains all the data.Tis cluster is recursively divided into smaller clusters as one moves from the root to the leaves, generating PCTs.PCTs represent a generalization of default decision trees (e.g., C4.5 [22]) where the outputs are more complex structures than in conventional classifcation and regression tasks.Classical PCTs can predict several types of structured outputs, including nominal/real value tuples, class hierarchies, and short time series [8].For each type, two functions must be defned: the prototype function and the variance function.Te prototype function associates a class label to each leaf in the tree and it returns a representative structured value (i.e., a prototype).Te variance function evaluates the homogeneity of a set of such structured values and is used to fnd the best splits while constructing the tree.
In this study, we propose semi-supervised PCTs and ensembles of semi-supervised PCTs, for the tasks of MLC and HMLC.Tus, in Sections 3.1 and 3.2, we present supervised PCTs for these tasks in more detail.
To build an ensemble model for predicting structured output, an appropriate type of PCTs is utilized as a base model.For example, to build an ensemble for the HMLC task, PCTs for HMLC are used as base models.An ensemble predicts a new example by considering predictions of all the ensemble's base models.For regression tasks, predictions of the base models are averaged, while for classifcation tasks, various strategies can be used, such as the probability-based majority voting, which we used as suggested in [23].According to this strategy, each base tree provides the probability of an example belonging to each of the possible classes.Te class with the highest sum of probabilities, considering all of the base trees, is predicted.
where Gini(E, Y i ) is the Gini score of the i th target variable Y i for a set of examples E. Te Gini score of the i th target variable is calculated as follows: where C i is the number of classes for the target variable Y i (e.g., if Y i is binary, then C i � 2) and p j is the apriori empirical probability of a class c j (i.e., the relative frequency of instances in E that belong to the class c j ).Te sum of the entropies of class variables can also be used as a variance indicator, i.e., Var f (E, Y) �  T i�1 Entropy(E, Y i ) (this was considered previously for MLC [24]).Te CLUS framework includes other variance functions as well, such as reduced error, gain ratio, and the m-estimate.
Te prototype function returns a vector denoting probabilities of an instance belonging to a given class for each target variable.To determine the predicted classes, the user can specify a threshold on probabilities, or the majority class (i.e., the most probable one) for each target can be calculated.In this study, we use the majority class.

PCTs for Hierarchical Multi-Label Classifcation.
In HMLC, the target space Y is associated to a hierarchy of classes (C, ≤ h ), where ∀c l , c j ∈ C: c l ≤ h c j ⇔ c l is a super class of c j .Te set of labels of example e i is represented as a binary vector L i , whose j th component is 1 if the example is labeled with the class c j , 0 otherwise.Te j th component of the arithmetic mean of such vectors contains the relative frequency of examples of the set belonging to class c j .Ten, the variance indicator over a set of examples E is calculated as the average squared distance between each vector (L i ) and the set's mean class vector (L): In the HMLC context, the similarities at higher levels of the hierarchy are considered to be more important than the similarities at lower levels.Te distance measure in the above formula (weighted Euclidean distance) is therefore defned as follows: where L i,l is the l th component of the class vector L i of an instance e i , |C| is the number of classes in the hierarchy (i.e., the size of the class vector), and the class weights ω(c) decrease with the depth of the class in the hierarchy.More precisely, ω(c) � ω de pth(c) 0 , where depth(c) denotes the depth of the class c in the hierarchy, and 0 < w 0 < 1.Note that class weights can be calculated recursively, i.e., ω(c) � ω 0 • ω(par(c)), where par(c) denotes the parent of class c.In this work, we use ω 0 � 0.75, as recommended in [25].
Te defnition of ω(c) is general enough to represent classes that are organized as a directed acyclic graph (DAG).Generally, a DAG-like hierarchy can be interpreted in two ways: an example belonging to a class c, either (i) belongs to all superclasses of c, or (ii) belongs to one or more superclasses of c.In this work, we consider the former.
Te variance indicator for tree-structured hierarchies uses the weighted Euclidean distance between the class vectors (as defned in equation ( 4)), where the weight of a class changes depending on its level in the hierarchy.Note that in DAG-shaped hierarchies, the classes do not have a unique level number.To resolve this issue, we follow the recommendation in [25]: the weight of a given class is calculated as an average of all the weights according to possible paths from the root to that class.
In classifcation trees, a leaf holds the majority class of its examples, which the tree predicts for examples arriving in that leaf.In the HMLC task, an example can have multiple classes, so the meaning majority class is not straightforward.Te prediction, in this case, is a mean L of the class vectors of the examples in the leaf.Te i th component of the vector L can be considered as the probability that an example in the leaf belongs to class c i .Te fnal classifcation for an example that arrives in the leaf can be made using a threshold τ for the probabilities; if L i > τ, then class c i is predicted for the example.When making predictions, the parent-child relationships from the class hierarchy are preserved if the values for the thresholds τ are defned as follows: τ i ≤ τ j whenever c i ≤ h c j (c i is an ancestor of c j ).Te selection of the threshold τ depends on the use scenario, e.g., trading of higher precision with lower recall.Here, we use a threshold-independent metric based on precision-recall curves to evaluate the predictive performance of the models.

Semi-Supervised PCT Learning for MLC and HMLC
4.1.Task Defnition.Here, we formally defne the semi-supervised learning tasks for the types of structured outputs considered in this study: predicting multiple targets and hierarchical multi-label classifcation.
4.1.1.Semi-Supervised Multi-Label Classifcation.In MLC, the task is to predict several binary values (i.e., labels) for each example.Tis is formalized as follows: Given: where each example e i ∈ E l is described according to both the descriptive space and the target space, and N l is the number of labeled examples.
where each example e i ∈ E u takes its values from the descriptive space only, and N u denotes the number of unlabeled examples.(v) A quality metric q, e.g., which favours models with high predictive accuracy (or low predictive error).
Find: A function f: X ⟶ Y that maximizes q.

Semi-Supervised Hierarchical
Multi-Label Classifcation.In HMLC, each example can have more than one class (multiple labels), and the classes are organized in a hierarchical structure, i.e., an example belonging to a class also belongs to all its superclasses.Tis is formalized as follows: Given: where each example e i ∈ E l is a pair of a tuple x i from the descriptive space and a set S i from the target space, and each set satisfes the hierarchy constraint, i.e., c where each example e i ∈ E u takes its International Journal of Intelligent Systems values from the descriptive space only, and N u is the number of unlabeled examples.(v) A quality metric q, e.g., which favours models with high predictive accuracy (or low predictive error).
Find: a function f: X ⟶ 2 C (where 2 C is the power set of C ) such that f maximizes q and the predictions made by f satisfy the hierarchy constraint, i.e., c 4.2.Tree Learning.Te proposed semi-supervised algorithm (see Table 1) is based on the extension of the standard topdown induction of decision trees (TDIDT) algorithm used to build supervised PCTs [26].An input to the TDIDT algorithm is a set of examples E. Te heuristic (h) selects the best tests (t * ) based on the reduction of the variance resulting from partitioning (P) the examples (BestTest function in Table 1).As the variance reduction is maximized, the homogeneity of the cluster is also maximized.If no suitable test is found, i.e., if none of the candidate tests results in a signifcant reduction of the variance or if there are fewer examples in a node than the specifed limit, then a leaf is created and the prototype of the examples in that leaf is computed.
As an input, the SSL-PCT algorithm uses a set of labeled examples (E l ), a set of unlabeled examples (E u ), and a w ∈ [0, 1] parameter.Te w parameter is optimized using the procedure that relies on internal cross-validation.
Te supervised TDIDT algorithm for PCTs is extended towards semi-supervised learning as follows.First, the input to the SSL algorithm dataset comprises both labeled and unlabeled examples: E � E l ∪ E u , where E l are labeled examples and E u are unlabeled examples.Second, the variance function in the SSL algorithm considers both the target and the descriptive attributes in the evaluation of splits.It is calculated as a weighted sum of the variance over the target space Y and the variance over the descriptive space X: (5) where w ∈ [0, 1] is the parameter that controls the trade-of between the contribution of the target space and the descriptive space to the variance function.During the learning of semi-supervised regression trees, the w parameter is automatically optimized by an internal cross-validation procedure (OptimizeParamW function in Table 1).
Tis extension relies on the semi-supervised cluster assumption [1]: if examples are in the same cluster, then they are likely to be of the same class.We recall that the variance function of supervised PCTs uses only the target attributes (equations ( 1) and ( 3)).Consequently, (a) unlabeled examples cannot contribute to the tree construction (since only their descriptive attributes are known), and (b) the clusters produced by supervised PCTs are only homogeneous regarding the class label.Enforcing the similarity of examples in both the descriptive and the target space during the construction of SSL-PCTs results in clusters that are homogeneous regarding both the descriptive and the target space.Tis allows us to exploit both labeled and unlabeled examples.Finally, following the cluster assumption, labeled and unlabeled examples that end up in the same leaf of the tree are likely to be of the same class.
Parameter w controls the magnitude of the contribution that unlabeled examples have on the learning of semi-supervised PCTs.In other words, parameter w enables learned models to range from fully supervised (w � 1) to completely unsupervised (w � 0).Te control of the contribution of unlabeled examples enabled by the w parameter allows us to set the amount of supervision for diferent datasets appropriately.Tis aspect is discussed in more detail in the experimental analysis (Section 6.3).
Te variance of a set of examples E on target space Y is calculated diferently, depending on the type of structured output at hand: Since the descriptive variables can be either numeric or nominal, the variance on the descriptive space of a set of examples E is computed as follows: where D is the number of descriptive attributes and the variance or the Gini score of descriptive attributes is calculated following equations ( 8) and ( 9).
Let N be the number of examples (both labeled and unlabeled), and let K i be the number of examples with nonmissing values of the i th attribute Y i .Ten, the variance for the continuous attributes and the Gini index for the nominal attributes are calculated as follows, respectively: where C i is the number of class values of Y i , and  p j is the apriori probability of class value c j , estimated by using only examples for which the value for variable Y i is known.Note that for the HMLC task, the variance for the output space is calculated only on the labeled data (see equation ( 6)).
Te variances of descriptive and target attributes are normalized, similarly to supervised PCTs, to ensure the equal contribution of attributes to the fnal variance.Normalization is performed by dividing the variance estimates of individual attributes in equations ( 8) and ( 9) (that consider the set of examples in the current node of the tree) with the variance of the corresponding attribute considering the entire training set.
During the semisupervised tree construction phase, two extreme cases can occur: (1) only unlabeled examples can end up in a leaf of the tree; therefore, the prototype function cannot be calculated, or (2) variance needs to be calculated for attributes where none of the examples (or only one) have nonmissing values (e.g., K i ≤ 1 in equation ( 8)).For the frst extreme case, we calculate the prototype function of such a leaf by returning the prototype of its frst parent node that contains labeled examples.In other words, from a leaf that contains only unlabeled examples, we move up the tree until we encounter a node containing labeled examples and we return the prototype of such a node.Te prototype is calculated using only labeled examples as described in Sections 3.1 and 3.2.Nodes with only unlabeled examples are not split further, while in leaf nodes containing labeled examples, we allow a minimum of 2 labeled examples.Both criteria can be considered as "stopping criteria," to stop the tree construction phase.Note that these criteria are coherent with the stopping criteria implemented in supervised PCTs, where at least two labeled examples in a leaf node are required.
Te second extreme case can occur when the examples in a node are split in a way that only unlabeled examples go into a single branch of the tree.In such a case, a split needs to be evaluated with one of the branches containing only missing values for the target attribute(s); therefore, variance for such attributes cannot be calculated.Similarly, as in the frst extreme, we handle this situation by using the variance of the parent node (for the attributes containing only missing values in the original node).Note that, since we do not split nodes with only unlabeled examples, the parent node is guaranteed to contain labeled examples.

Semi-Supervised PCTs with Feature Weighting.
PCTs (and decision trees in general) are robust to irrelevant features since the learning algorithm chooses only the most International Journal of Intelligent Systems informative features when building (supervised) trees.Tus, irrelevant features will be ignored.However, in semi-supervised PCTs, this feature may be compromised, since the evaluation of the splits depends on both target and descriptive attributes.To deal with this issue, we propose feature-weighted SSL-PCTs.
Methods for feature weighting can be used to identify the most informative features by determining an importance score (weight), where a higher score denotes more informative features, while a lower score denotes less informative ones.Te efectiveness of feature weighting with the importance scores was shown to help the k-nearest neighbors algorithm to deal with irrelevant features [27].Similarly, we adapt the SSL-PCTs and use importance scores to assign weights to features.
More specifcally, we use a feature ranking method based on a random forest of PCTs [8], to obtain the importance score  σ i for each descriptive attribute X i .To calculate feature importance, this method uses the internal out-of-bag (OOB) error as an estimate of the noise in the descriptive space.Te rationale is that if noise is introduced to a descriptive variable which is important, then the error of the model will increase (as measured by OOB error estimates).
Te feature ranking is performed on the labeled examples E l prior to building SSL-PCTs or SSL-RFs.Te importance scores are then normalized as follows: Te function for the calculation of the variance of the descriptive attributes of SSL-PCTs is then adapted to include normalized feature importance scores σ i as weights of the descriptive attribute X i : Tis results in irrelevant features contributing less to the variance score.Henceforth, semi-supervised PCTs and random forests with feature weighting are denoted as SSL-PCT-FR and SSL-RF-FR, respectively.

4.3.
Semi-Supervised Random Forests.SSL-PCTs can be easily extended to their random forest version [28].Tis is done by using SSL-PCTs as the members of the random forest ensemble, instead of using classical supervised trees.Te notable diference is, however, in the presence of both labeled and unlabeled examples in the bootstrap samples, which does not conform with the classical random forest algorithm [28].Tat is, the trees can be built with only Second, SSL-PCTs use both D descriptive variables and T target variables to determine the best split; therefore, this step has the complexity of O((T + D)ND).Terefore, the computational complexity of building an SSL-PCT tree is Te computational complexity of random forests of semi-supervised PCTs is bounded by , where N ′ is the number of bootstrap samples, D ′ is the number of features considered at each tree node, and k is the number of trees.Te added computational complexity of feature ranking is that of randomly permuting the values of the out-of-bag samples (N ″ � N − N ′ ) and sorting the samples through the tree.Both operations are done for each descriptive attribute and their cost is O(DN ″ + D log N).Tis added computational cost is, however, negligible compared to the overall cost of building the random forest ensemble.Note that the number of examples in feature ranking is E l because the feature weights are calculated considering only the labeled examples.

Experimental Design
In this section, we frst describe the datasets used in the experimental evaluation.Next, we present the evaluation procedure, the specifc parameter settings of the algorithms, and the performance measures.

Data Description.
To evaluate the proposed methods, we use 24 datasets of the two structured output prediction tasks considered: MLC and HMLC.Te datasets are from various domains and have diferent sizes and numbers of descriptive and target variables.Te characteristics of the datasets are summarized in Tables 2 and 3 for the MLC and HMLC tasks, respectively.

Experimental Setup.
We introduce semi-supervised PCTs (SSL-PCT) and their feature-weighted variant (SSL-PCT-FR).We compare these methods across diferent structured output prediction tasks with supervised PCT 8 International Journal of Intelligent Systems algorithms for MLC and HMLC, denoted as SL-PCT, in order to estimate the contribution of unlabeled data to the predictive performance of the methods under the same conditions.By such comparison, we can answer our main question: Are SSL-PCTs able to outperform supervised PCTs?In the experiments with single trees, we use the pruning procedure as implemented in M5 regression trees [22].We also compare the predictive performance of semisupervised random forests (SSL-RF) and their featureweighted variant (SSL-RF-FR) to supervised random forests for structured output prediction (CLUS-RF).We use 100 unpruned trees to construct random forests.Te number of features randomly selected at each node was set to  log 2 (D) + 1, where D is the total number of features [28].
To assess the infuence of diferent proportions of labeled/unlabeled data for the semi-supervised method, we vary the number of labeled examples across the following set of values: {50, 100, 200, 350, 500}.Te labeled examples are randomly sampled from the training set, while the rest of the examples are used both as unlabeled examples and as testing data.We temporarily ignore their labels and use them in the semi-supervised methods as unlabeled training samples.Te test set used to evaluate the models comprises the same examples and their original labels restored.Te evaluation scenario is thus in the context of transductive learning.Te supervised methods are trained using the selected labeled samples and evaluated on the same test set as semi-supervised methods.Tis is repeated 10 times using diferent random initialization, while the predictive performances are averaged over the 10 runs.
For each of the 10 runs, we optimize the parameter w (weight) by an internal 3-fold cross-validation procedure performed on the labeled portion of the training set.Te semi-supervised methods also use the available unlabeled examples.Te values of the parameter w vary from 0 to 1 with a step of 0.1.
Te algorithms are evaluated by means of the area under the Precision-Recall curve (AUPRC).Since the considered tasks are MLC and HMLC, we use a variant of the  International Journal of Intelligent Systems AUPRC-the area under the micro-averaged average Precision-Recall curve (AUPRC), as suggested in [25].Specifcally, the precision and recall values are computed as follows: where i ranges over all the classes.
We statistically analyze the results following the recommendations of Demsar [45].We use the nonparametric Wilcoxon paired signed-rank test [46] for the comparison of the predictive performance of the two methods over multiple datasets.We set the signifcance level to 0.05 in all the experiments.
We can clearly observe that semi-supervised PCTs are superior to SL-PCTs on most of the datasets.Tat is, on 8 out of 12 datasets, either SSL-PCTs or SSL-PCT-FRs (or both) dominate the performance of SL-PCTs by a good margin.On the other four datasets, namely, Corel5k, Emotions, Mediana, and SIGMEA real, the performance of supervised and semi-supervised PCTs is mostly the same as or similar to the performance of SL-PCTs.
Intuitively, the improvement of semi-supervised over supervised methods should diminish as the number of labeled examples increases, and eventually, semi-supervised and supervised methods are expected to converge to the same or similar performance.However, the "convergence point" changes from dataset to dataset.For instance, for Genbase, at 500 labeled examples, we already see this convergence.For the other datasets, the improvement of semi-supervised over supervised methods decreases less quickly.
Te feature-weighted semi-supervised method (SSL-PCT-FR) and the non-feature-weighted one (SSL-PCT) have similar trends in predictive performance.However, on some datasets, there are notable diferences.Namely, on Birds and Scene datasets, feature weighting is benefcial for the predictive performance of SSL-PCTs and even necessary for improvement over SL-PCTs on the Birds dataset with ≥350 labeled examples.On the other hand, feature weighting clearly damages the predictive performance of the SSL-PCT method on the Bibtex dataset.Tus, feature weighting helps in most cases, but the empirical results cannot support its use by default when building SSL-PCTs for MLC.
We next compare semi-supervised random forests (SSL-RF) with supervised random forests (CLUS-RF).From the results, we can observe that CLUS-RF improves over CLUS-RF on several datasets: Bibtex, Corel5k, Genbase, Medical, SIGMEA real, and marginally on Emotions and Enron datasets.However, as compared to single trees, the improvements of the semi-supervised approach over the supervised are observed on fewer datasets and are smaller in magnitude.In other words, the improvement of SSL-PCTs over SL-PCTs does not guarantee the improvement of SSL-RF over CLUS-RF (e.g., Mediana and Yeast datasets), and vice versa, SSL-RF can improve over CLUS-RF even if SSL-PCTs does not improve over SL-PCTs (e.g., SSL-RF-FR on Emotions dataset for 200 and 350 labeled examples).As observed for the single trees, there is no clear advantage to using feature weighting when semi-supervised random forests are built, even though it is somewhat helpful on the Emotions and Enron datasets.
Feature-weighted PCTs and RFs could possibly be improved by considering a semi-supervised feature selection [47], instead of the supervised method based on random forests, as used here.

Hierarchical
Multi-Label Classifcation. Figure 3 presents the learning curves in terms of the predictive performance (AUPRC) of semi-supervised (SSL-PCT, SSL-PTC-FR, SSL-RF, and SSL-RF-FR) and supervised methods (SL-PCT and CLUS-RF) on the 12 hierarchical multi-label classifcation datasets.
We observe diferent behaviours on the 6 functional genomics datasets and the 6 datasets from the other domains.Tat is, on all 6 datasets of the latter group, semisupervised PCTs improve over SL-PCTs-albeit not necessarily always for all available amounts of labeled data.Tis is the case on the Enron and ImCLEF07A datasets, where both SSL-PCTs and SSL-PCT-FR dominate the performance of SL-PCTs, while it seems that on other datasets, at least 100 (Slovenian rivers and ImCLEF07D) or 500 (Danish farms) labeled examples are needed to improve over SL-PCTs.
On the other hand, semi-supervised PCTs are not so successful on functional genomics datasets.Analysis of the tree sizes (see Section 6.4) reveals an explanation for such results.Tat is, on all of the 6 functional genomics datasets, and for almost all diferent amounts of labeled data, both supervised and semi-supervised trees are composed of only one node.Note that these datasets have extremely large label hierarchies which are very sparsely populated.It seems that for such datasets the amount of labeled data we considered (i.e., up to 500 labeled examples) is not sufcient to build trees-neither supervised nor semi-supervised.In fact, semisupervised trees have more than one node on Expr-GO ( ≥ 350 of labeled examples) and Eisen (for 500 labeled examples) datasets, and those are exactly the occasions where they improve over supervised trees.We thus hypothesize that for larger amounts of labeled data, SSL-PCTs could outperform supervised PCTalso on functional genomics datasets.
In the HMLC task, the feature-weighted semi-supervised method (SSL-PCT-FR) and non-feature-weighted one (SSL-PCT) mostly have a very similar performance.Again, as for the MLC task, there is no clear beneft of feature weighting.International Journal of Intelligent Systems Finally, the semi-supervised random forests (SSL-RF and SSL-RF-FR) outperform supervised random forests (CLUS-RF) on some datasets, namely, in the initial part of the learning curve for the Enron dataset, and for the Church-GO and Derisi-GO datasets, meaning that unlabeled data improve the predictive performance of random forests of PCTs for HMLC.On the remaining datasets, it seems that unlabeled data are not benefcial for the performance of random forests of PCTs for HMLC.

Statistical Analysis of Predictive Performance. Te results
of the statistical analysis (Table 4) show that SSL-PCTs and SSL-PCT-FR are statistically signifcantly better than the SL-PCTs for most of the diferent amounts of labeled data, considered for both structured output prediction tasks.More specifcally, for the HMLC task, usually, at least 200 labeled examples are needed to achieve statistical signifcance.In the MLC task, on the other hand, SSL-PCTachieves statistically signifcantly better results than SL-PCTup to 200 labeled examples.In this task, the feature-weighted SSL-PCTs are more successful: statistically, they signifcantly outperform SL-PCT across all diferent amounts of labeled examples.
Considering the feature-weighted and non-featureweighted semi-supervised methods (both single trees and ensembles), there is no statistically signifcant diference between them in most cases, except at the HMLC task for 200 labeled examples where, statistically, SSL-PCT-FR signifcantly outperforms SSL-PCT.
As discussed previously, semi-supervised random forests improve over supervised ones in fewer cases as compared to single trees.A statistically signifcant improvement over CLUS-RF is observed only for the MLC task with 200 labeled examples and the HMLC task with 350 labeled examples.However, in none of the cases, did the proposed semi-supervised methods perform statistically signifcantly worse than their supervised counterparts.
Te statistical test is applied to the predictive performances (AUPRC) of the supervised and semi-supervised single trees (SL-PCT, SSL-PCT, and SSL-PCT-FR) on the datasets considered in this study: 12 for multi-label classifcation and 12 for hierarchical multi-label classifcation.In bold we report signifcant P values ( < 0.05).In a comparison of the two algorithms, i.e., Algorithm 1 vs. Algorithm 2, the "− " sign indicates that the sum of ranks where the frst algorithm outperformed the second is higher than the sum of ranks where the second algorithm outperformed the frst.Te "+" sign indicates the opposite.

Infuence of the Amount of Supervision.
As previously mentioned, the amount of supervision in the SSL-PCTs is controlled by the w parameter, where w � 0 results in unsupervised PCTs, 0 < w < 1 in semi-supervised PCTs, and w � 1 in supervised PCTs.Tis ability to tune the degree of supervision in SSL-PCTs for the predictive problem at hand is of great practical importance.Tat is, semi-supervised methods can, in general, degrade the performance of their supervised counterparts [48][49][50][51].In this respect, some studies noted that the success of semi-supervised methods is domain-dependent [52].How to choose a suitable SSL method for the dataset at hand is an unresolved issue; therefore, even if the primary task of SSL methods is to achieve improved performance in comparison to supervised methods, it is also a high priority to make semi-supervised methods safe, i.e., to make sure that they do not perform worse that their fully supervised counterparts.
In SSL-PCTs, such a safety mechanism is provided by the w parameter.Teoretically, given the optimal value of w, SSL-PCTs and SSL-RF would always perform at least as well as their supervised counterparts both for MLC and HMLC.Te reason is that SL-PCTs and CLUS-RF are special cases of SSL-PCT and SSL-RF when w � 1.In practice, however, the w parameter is chosen via internal cross-validation on labeled examples in the training set.Tus, it is possible to select w suboptimal for the test set considered.

International Journal of Intelligent Systems
Our empirical evaluation showed that SSL-PCT and SSL-RF rarely degrade the performance of SL-PCT and CLUS-RF (Figures 2 and 3).Across all the experiments we performed, SSL-PCTs outperformed their supervised counterpart (SL-PCT) in 52% of the experiments, performed worse in 9% of the experiments, and performed equally in 39% of the experiments.Moreover, the occasional degradation of the predictive performance was small compared to the improvement of SSL-PCTover SL-PCT.For example, the average relative improvement of SSL-PCTs over SL-PCTs (across all the experiments) was 40%, while the average degradation was 7%.
Figure 4 clearly shows the role of parameter w on the predictive performance for 4 datasets with diferent types of structured output.Te Emotions dataset (Figure 4(a)) requires no supervision because w � 0 provides better predictive performance of the SSL-PCT method, whereas the Genbase dataset (Figure 4(b)) requires small amount of supervision (i.e., w close to 0) for SSL-PCT and high amount of supervision (i.e., w close to 1) for SSL-RF.For the HMLC dataset, Danish farms (Figure 4(c)), more supervision (i.e., higher w) provides better predictive performance of SSL-PCT.However, for up to 500 labeled examples, the SSL-PCT method is unable to improve its supervised counterpart; therefore, w � 1 is selected to prevent performance degradation.For the other HMLC dataset, ImCLEF07A (Figure 4(d)), on the other hand, the performance drops for high levels of supervision (i.e., w > 0.5).
In conclusion, our results show that the optimal value of w depends on the dataset and on the diferent amounts of labeled data, as exemplifed in Figure 4. Tis confrms our initial intuition that a diferent amount of supervision is suitable for diferent datasets.Terefore, it is difcult to provide a general recommendation for the value of w, and it is advisable to optimize this parameter by internal crossvalidation for each dataset, as it is done in our study.

Interpretability of the Models.
Interpretability of the predictive models is often a desirable property of machine learning algorithms.Since the models produced by the SSL-PCTs are in the form of a decision tree, they are readily interpretable.To the best of our knowledge, in the literature, no other semi-supervised method for MLC and HMLC produces interpretable models.
Te degree of interpretability of the tree-based models is typically expressed in terms of their size.A large tree can be more difcult to interpret, and vice versa, a small tree can be easier to interpret.Te tree size is often a trade-of between accuracy and interpretability.Small trees are easy to interpret but due to their simplicity may fail to capture interactions in the data and therefore provide a satisfactory accuracy.On the other hand, larger trees may mitigate such issues, but at the cost of lower interpretability.Note that increased size does not necessarily mean improved predictive power of tree models, due to possible overftting.In general, it is not easy to identify (a priori) the best size of a tree, in order to balance between overftting and underftting.
In Table 5, we compare tree sizes of supervised and semi-supervised PCTs.We observe that, on average, the semi-supervised trees are somewhat larger than the supervised trees.Tis is intuitive since semi-supervised algorithms use much more data to grow the trees, i.e., both labeled and unlabeled examples.If we focus on individual datasets, we can observe that the size of both the supervised and semi-supervised trees is mainly in the range of a few tens of nodes.Tis is still a reasonable size for manual inspection.However, there are a few exceptions.Semisupervised trees are sometimes, with a few hundred nodes, much larger than the corresponding supervised trees.In particular, this can be observed in the following datasets: Mediana ( ≥ 350 labeled), Danish farms (500 labeled), ImCLEF07A, and ImCLEF07D ( ≥ 200 labeled).Tese cases, generally characterized by a large number of classes, can be infeasible for analysis.
To exemplify the interpretability and to highlight the possible diferences between SL-PCTs and SSL-PCTs, we provide an example of supervised and semi-supervised predictive clustering trees obtained for the Emotions dataset with 100 labeled examples (Figure 5) where the task is to   A closer analysis of the results is shown in Figure 6, where it is possible to evaluate the infuence of parameter w on the tree size.Te analysis reveals that unsupervised trees (w � 0) are much bigger than semi-supervised (0 < w < 1) or supervised (w � 1) trees.Unsupervised trees do not rely on the output space at all; therefore, it is understandable that, in the presence of a very large amount of unlabeled data, big trees are grown.We recall that the W parameter is optimized for predictive performance, but by increasing the value of w (i.e., increasing the degree of supervision), a trade-of between tree size and model performance can be achieved.

Training times.
In Table 6, we present the training times of supervised and semi-supervised algorithms.For simplicity, we present times for experiments with 500 labeled examples, since conclusions for other amounts of labeled data are similar.We can observe that semi-supervised PCTs and random forests can take considerably more time to train the model than their supervised counterparts, which is expected since they use more data (i.e., additional unlabeled examples) and they also calculate the heuristic score to determine the best splits across all descriptive and target attributes, as opposed to the supervised algorithms that use only the target attributes.Te increased learning time is hence the most pronounced on datasets with many attributes, such as Expr-GO and Enron datasets for the HMLC task and Bibtex and Medical datasets for the MLC task.Note that in some cases, the learning times between supervised and semi-supervised algorithms are the same.Tis is because in such cases w � 1 was chosen, i.e., the semi-supervised model is equal to the supervised one.Note that in Table 6, the time used to optimize the w parameter is not included.
Te training times are in seconds, obtained for experiments with 500 labeled examples.6.6.Te Infuence of Unlabeled Data.SSL-PCTs difer from supervised PCTs in two aspects: (i) they use both the descriptive attributes and target variables for the candidate split evaluation and (ii) they use unlabeled examples in the training process.We have shown that SSL-PCTs have highly competitive predictive performance with respect to supervised PCTs, but we can still question the source of this improvement.Is this improvement due to the combination of (i) and (ii)?Or is (i) sufcient to yield improvements over supervised PCTs?To answer this question, we compare SSL-PCTs with the supervised modifcation of PCTs which use both the descriptive attributes and target variables for split evaluation in the same way as SSL-PCTs, but does not use unlabeled data (henceforth, this variant will be denoted as SL-PCT D+T ).By using this modifcation, we can evaluate the efect of the unlabeled examples on the predictive performance, since both SSL-PCTs and SL-PCT D+T are trained using the same algorithms-the only diference being in the usage of unlabeled data.In these experiments, we optimize parameter w for SL-PCT D+T via internal 3-fold crossvalidation, analogously to SSL-PCTs.
Considering all the datasets and the various percentages of labeled data, the SL-PCT D+T algorithms perform better than the SL-PCT in 36% of the cases, the same in 54% of the cases, and worse in 11% of the cases.We recall that the corresponding fgures for the SSL-PCTs algorithm are 52%, 39%, and 9%.Tus, even without the help of unlabeled data, the SSL-PCTs proposed in this work can improve over SL-PCTs, but they have a better chance to do so if they are supplied with unlabeled data.Te following result shows that the unlabeled data are indeed the principal component for the success of the SSL-PCTs: the average relative improvement of SL-PCT D+T over SL-PCT is a mere 4%, while for SSL-PCT, this fgure is 40% (considering only the cases where SL-PCT D+T and SSL-PCTs improve over SL-PCTs, respectively).Tis observation, i.e., the importance of unlabeled data, is in line with the fndings of Ženko [53], where a rule learning process that considers both the descriptive and target spaces is adopted.Te results reported in [53]  International Journal of Intelligent Systems show that including the descriptive space in the heuristic was not benefcial for the predictive performance of predictive clustering rules.However, the study was performed in a supervised learning context, i.e., unlabeled examples were not used.Finally, Figure 7 allows a detailed evaluation of the improvement/degradation of SL-PCT D+T over SL-PCT and of SL-PCT D+T over SSL-PCT.As stated previously, the SSL-PCT method outperforms SL-PCT more often than SL-PCT D+T (this happens when the points are above the diagonal).Furthermore, SSL-PCT yields much larger improvements over SL-PCT than SL-PCT D+T (for most of the points in the fgure, the improvement along the y-axis is much larger than the improvement along the x-axis).However, there is some complementarity between the two methods.Tat is, SL-PCT D+T sometimes improves over  International Journal of Intelligent Systems SL-PCT even when this is not the case with SSL-PCT (Figure 7, the values on the positive side of the x-axis, below the dashed line).

Conclusions
In this study, we propose an algorithm for multi-label classifcation and for hierarchical multi-label classifcation that works in a semi-supervised learning setting.Te method is based on predictive clustering trees and uses both the target and the descriptive space for the evaluation of candidate splits.We executed an extensive empirical study using 24 datasets and we summarize the main fndings as follows: (i) Te proposed semi-supervised predictive clustering trees achieve good predictive performance on both structured output tasks.On many of the datasets considered, their predictive performance was superior to that of supervised predictive clustering trees.(ii) Te control on the amount of supervision to be used when learning the proposed semi-supervised predictive clustering trees makes them safe to use: they do not degrade the performances with respect to their supervised counterparts, i.e., they either outperform them or have the same performance.
(iii) Te degree of superiority of semi-supervised over supervised predictive clustering trees does not translate entirely to the tree ensembles, even though semi-supervised random forests often outperform supervised random forests.(iv) Weighting descriptive attributes by their importance may help the predictive performance of semisupervised predictive clustering in some cases, but the advantages are not great enough to advocate the use of feature weighting by default.Tus, by the principle of Occam's razor, the simpler solution should be preferred, that is, the one without feature weighting.(v) Te semi-supervised trees produce readily interpretable models and are marginally larger than the supervised trees, though the sizes of the trees are reasonable for manual inspection in most cases.Also, this comes with an increase in the computational cost as evidenced by the theoretical and empirical (runtime) analysis of the computational complexity.
In future work, we intend to extend the proposed semisupervised (hierarchical) multi-label classifcation algorithm to the case where examples are not independent and are accommodated in a network data structure.Tis would International Journal of Intelligent Systems allow us to exploit the semi-supervised learning setting in network data, where the smoothness assumption naturally holds.

Figure 1 :
Figure 1: Semi-supervised learning in multi-label classifcation.Filled circles represent labeled examples, while empty circles represent unlabeled examples.Letters represent class labels.(a) Labeled examples only.(b) Labeled and unlabeled examples.

3. 1 .
PCTs for Multi-Label Classifcation.Te variance function for learning PCTs for the MLC task computes the average of the Gini indices across all the target variables.For a set of examples E with target space Y, consisting of T nominal target variables Y 1 , Y 2 , . . ., Y T , the variance function is defned as follows: a small set of labeled examples and a large set of unlabeled examples, and thus bootstrap samples may end up containing only unlabeled examples.In order to overcome this problem, in the semi-supervised setting, we perform stratifed bootstrap sampling where the proportions of labeled and unlabeled examples are preserved in each bootstrap sample.For example, if the training data contain 10% of labeled and 90% of unlabeled examples, such a ratio is maintained in bootstrap samples.Tis is achieved by separately sampling labeled and unlabeled examples and later joining them to form a bootstrap sample for the random forest algorithm.4.4.Computational Complexity.To assess the complexity of the algorithm for learning SSL-PCTs, we frst introduce the computational complexity of learning supervised PCTs: sorting of D descriptive variables (O(DN log N)), used to determine the best split for T target variables (O(TDN)), for N labeled training examples (O(N)).If we assume that the expected depth of the tree is O(log N) [29], the computational complexity of building a single PCT is O(DN log 2 N) + O(TND log N) + O(N log N).Now we discuss the changes introduced in SSL-PCTs.First, the number of training examples N in the semi-supervised case equals the combined number of unlabeled and labeled examples (i.e., N � N l + N u , instead of N � N l ).

Figure 4 :
Figure 4: Infuence of parameter w on SSL-PCT (red line) and SSL-RF (orange line) methods.Te results refer to 4 datasets with diferent types of structured outputs: (a) Emotions (MLC), (b) Genbase (MLC), (c) Danish farms (HMLC), and (d) ImCLEF07A (HMLC).Te w values selected by the internal cross-validation algorithm and used in the experiments are marked with colored dots.

Figure 5 :
Figure 5: Supervised and semi-supervised predictive clustering trees obtained for the Emotions dataset with 100 labeled examples.(a) Supervised PCT.(b) Semi-supervised PCT.

Figure 6 :
Figure 6: Average tree size per value of parameter w across all datasets and amounts of labeled data.

Figure 7 :
Figure 7: Te graph depicts the magnitude of improvement in the predictive performance over supervised PCTs enabled by (i) the variance function that considers both the descriptive and target spaces (x-axis) and (ii) unlabeled data and the variance function that considers both the descriptive and target spaces (y-axis).Tis is measured by the diference in the predictive performance of SL-PCT D+T and SL-PCT (ΔSL − PCT D+T ; x-axis) and of SSL-PCT and SL-PCT (ΔSSL − PCT; y-axis).Te positive values along the x and y axes denote that SL-PCT D+T or SSL-PCT improves over SL-PCTs, respectively.Clearly, the magnitude of improvement over SL-PCTs along the y-axis is much larger than along the x-axis, showing that the unlabeled examples are crucial for the performance of the SSL-PCTs.Each dot represents AUPRC of one experiment (one dataset and one percentage of labeled data; all experiments are considered).(a) Multi-label classifcation.(b) Hierarchical multi-label classifcation.

Table 1 :
Te proposed algorithm for learning of semi-supervised predictive clustering trees.

Table 2 :
MLC datasets and their characteristics.is the number of examples, D/C is the number of descriptive variables (nominal/continuous), L is the number of labels, and L L is the average number of labels per example. N

Table 3 :
HMLC datasets and their characteristics.
N is the number of examples, D/C is the number of descriptive variables (nominal/continuous), H is the type of the label hierarchy, |H| is the number of nodes in the hierarchy, H d is the maximal depth of the hierarchy, and L L is the average number of labels per example.

Table 4 :
P values of the Wilcoxon signed-rank test.

Table 5 :
Model sizes expressed as the number of nodes in trees.

Table 6 :
Training times for SL-PCTs and SSL-PCTs.