Optimism in Active Learning

Active learning is the problem of interactively constructing the training set used in classification in order to reduce its size. It would ideally successively add the instance-label pair that decreases the classification error most. However, the effect of the addition of a pair is not known in advance. It can still be estimated with the pairs already in the training set. The online minimization of the classification error involves a tradeoff between exploration and exploitation. This is a common problem in machine learning for which multiarmed bandit, using the approach of Optimism int the Face of Uncertainty, has proven very efficient these last years. This paper introduces three algorithms for the active learning problem in classification using Optimism in the Face of Uncertainty. Experiments lead on built-in problems and real world datasets demonstrate that they compare positively to state-of-the-art methods.


Introduction
Traditional classification is a supervised learning framework in which the goal is to find the best mapping between an instance space and a label set. It is based only on the knowledge of a set of instances and their corresponding labels called the training set. To obtain it, an expert or oracle is required to manually label each of the examples, which is expensive. Indeed, this task is time consuming and may as well involve any other kind of resources. The aim of active learning [1] is to reduce the number of requests to the expert without losing performances, which is equivalent to maximizing the performance with a certain number of labeled instances. This can be done by dynamically constructing the training set. Each new instance presented to the expert is thus carefully chosen to generate the best gain in performance. The selection is guided by all the previous received labels. This is a sequential decision process [2].
However, the gain in performance due to a particular instance is not known in advance. This is for two reasons: first, the label given by the expert is not known before querying, and second, the true mapping is unknown. However, those values can be estimated also more and more precisely as the training set grows, because it is the goal of classification to get a good estimate of those values. Still, a low confidence must be put on the first estimations while later estimations may be more trusted. An instance may thus be presented to the expert because it is believed to increase the performances of the classifier, resulting in a short term gain. Or, because it will improve the estimations and help to select better instances in the future, resulting in a long term gain. This is a very common problem in literature known as exploration versus exploitation dilemma. It has been successfully addressed under the multiarmed bandit problem, as introduced in [3] and surveyed in [4]. In this problem, a set of arms (choices) is considered, where each provides a stochastic reward when pulled (selected). The distribution of rewards for an arm is initially unknown. The goal is to define a strategy to successively pull arms, which maximizes the expected reward under a finite budget of pulls. Several methods have been introduced to solve this dilemma. One of them is the Upper Confidence Bound algorithm, introduced in [5]. It 2 Computational Intelligence and Neuroscience uses the approach of Optimism in the Face of Uncertainty, which selects the arm for which the unknown expected reward is possibly the highest. One notable advantage of those algorithms is that they come with finite-sample analysis and theoretical bounds.
The idea is thus to use Optimism in the Face of Uncertainty for the Active Learning problem in classification. To use this approach, the problem is cast under the multiarmed bandit setting. However, this one deals with a finite number of arms, whereas in classification the instance space may be continuous. In order to adapt it to classification, the instance space is partitioned into several clusters. The goal is thus to find the best mapping between the clusters and the label set, under a finite budget of queries to the expert.
At first, we study the case of independent clusters, where the label given to each cluster only depends on the samples taken in it. We show two algorithms capable of the online allocation of samples among clusters. In this context, we need at least one (or even two) sample in each cluster in order to start favoring one for selection. Thus, the number of clusters must not be too high. This implies using a coarse partition which may limit the maximum performance. The choice of this partition is thus a key issue which has no obvious solution.
Allowing the prediction of each cluster to depend on the samples received in others enables us to use a more refined partition. This makes the choice of the partition less critical. We thus study the case of information sharing clusters. The adaptation of the first case to this one goes through the use of a set of coarse partitions combined by using a Committee of Experts approach. We introduce an algorithm that allocates samples in this context. Doing so, the number of clusters is not limited anymore, and increasing it allows us to apply our algorithms on a continuous instance space. Another algorithm is introduced as an extension of the first one using a kernel.
We start by an overview of the existing methods in active learning in Section 2. Then, in Sections 3-5, we describe the algorithms. We handle the cases of independent cluster and information sharing clusters. For each one of these problem we define a new loss function that has to be minimized. We also define the confidence interval used by our optimistic algorithms. In Section 6, we evaluate the performance of the algorithms in both built-in problems and real world datasets.

Related Work
Many algorithms already exist for active learning. A survey of those methods can be found in [6]. Among them, uncertainty sampling [7] uses a probabilistic classifier (it does not truly output a probability but a score on the label) and samples where the label to give is least certain. In binary classification with labels 0 or 1, this is where the score is closest to 0.5. Query by committee [8,9] methods consider the version space or hypotheses space as the set of all consistent classifiers (nonnoisy classification) and try to reduce it as fast as possible by sampling the most discriminating instance. It finishes when only one classifier is left in the set. Extensions exist for the noisy case, either by requiring more samples before eliminating a hypothesis [10] or by associating a metric to the version space and trying to reduce it [11,12]. Other algorithms exist that use a measure of confidence for the labels currently given, such as entropy [13] or variance [14]. Finally, the expected error reduction [15][16][17][18] algorithms come from the fact that the measure of performance is mostly the risk and that it makes more sense to minimize it directly rather than some other indirect criteria. Our work belongs to this last category. Using an optimistic approach enables us to minimize directly the true risk instead of the expected belief about it.
Other methods also use Optimism in the Face of Uncertainty for active learning. In [19], the method is more related to query by committee since it tries to select the best hypothesis from a set. It thus considers each hypothesis as an arm of a multiarmed bandit and plays them in an optimistic way. In [20], the authors study the problem of estimating uniformly well the mean values of several distributions under a finite budget. This is equivalent to the problem of active learning for regression with an independent discrete instance space. Although this algorithm may still be used on a classification problem, it is not designed for that purpose. Indeed, a good estimate of the mean values leads to a good prediction of the label. However, from the active learning point of view, it will spend effort to be precise on the estimation of the mean value even if this precision is of no use for the decision of the label. Efforts could have been spent to be more certain about the label to give. The importance of having an algorithm specifically adapted to classification is evaluated in Section 6.

Materials and Methods
The classical multiarmed bandit setting deals with a finite number of arms. This is not appropriate for the general classification problem in which the instance space may be continuous. In order to adapt this theory to active learning, we must first study the case of a discrete instance space, which may come from a discretized continuous space or originally discrete data. At first, we study the case of independent clusters, where no knowledge is shared between neighbors. After that, we will improve the selection strategy by letting neighbor clusters to share information. At the end, by defining clusters that contain only one instance from the pool each, with a good generalization behavior, we are able to apply this theory to continuous data. We may even define externally the relations between instances and use a kernel.
Let us define the following notations. We consider the instance space and the label set . In binary classification, the label set is composed of two elements, in this work = {0, 1}. The oracle is represented by an unknown but fixed distribution ( ∈ | ∈ ). The scenario considered in this work is pool-based sampling [7]. It assumes that there is a large pool of unlabeled instances available from which the selection strategy is able to pick. At each time step , an active learning algorithm selects an instance ∈ from the pool, receives a label ∈ drawn from the underlying distribution, and add the pair to the training set. This is repeated up to time . The aim is to define a selection strategy that generates the best performance of the classifier at time .
Computational Intelligence and Neuroscience 3 The performance is measured with the risk, which is the mean error that would achieve the classifier by predicting labels.

Partition of the Instance Space.
In this section, we focus on the problem of defining a selection strategy with a discrete instance space. Either the space is already discrete or a continuous space is partitioned into several clusters. The following formulation assumes the latter case; otherwise, the same formulation applies for the discrete case if clusters are replaced with instances. The instance space is thus divided into clusters. The problem is now to choose in which cluster to sample.
Let us define the partition with the following properties: (i) ∀ ∈ ⟦1, ⟧ : ̸ = 0, no cluster is empty, = , the clusters cover the whole instance space, It is important to note that the partition does not change during the progress of the algorithm.
Having discretized the instance space, we can now formalize the problem under a K-armed bandit setting. Each cluster ∈ is an arm characterized by a Bernoulli distribution ] with mean value . Indeed, samples taken in a given cluster can only have a value of 0 or 1. At each round, or time step, ≥ 1, an allocation strategy selects an arm ∈ ⟦1, ⟧, which corresponds to picking an instance randomly in the cluster and receives a sample , ∼ ] , independently of the past samples. Let ( ) ∈⟦1, ⟧ denote the weight of each cluster, with ∑ =1 = 1. For example, in a semisupervised context using pool-based sampling, each weight is proportional to the number of unlabeled data points in each cluster, while, in membership query synthesis, the weights are the sizes or areas of clusters.
Let us define the following notations: is the number of times arm has been pulled up to time and is the empirical estimate of the mean at time . Under this partition, the mapping of the instance space to the label set is limited to the mapping of clusters to the label set. We thus define the classifier that creates this mapping according to the samples received up to time . In this section, the clusters are assumed to be independent. This means that the label given to a cluster can only depend on samples in this cluster. We use the naive Bayes classifier that gives the label to cluster , where [⋅] is the round operator.

Full Knowledge Criteria.
The goal is to build an optimist algorithm for the active learning problem. A common methodology in the Optimism in the Face of Uncertainty paradigm is to characterize first the optimal solution. We thus place ourselves in the Full Knowledge setting. In this setting, we let the allocation strategy depend on the true value of for each cluster, and this defines the optimal allocation of the budget . An optimist algorithm will then estimate those values and allocate samples as close as possible to the optimal allocation. Note that the true values of cannot be used by the classifier directly but only by the allocation strategy.
In the following sections, we show two full knowledge criteria: data-dependent and data-independent. In the dataindependent case, the optimal allocation does not depend on the samples received so far. It can be related to one-shot active learning, as defined in [18], in which the allocation of the budget is decided before sampling any instances. In the data-dependent case, the label given by the classifier at time is also considered. This is related to fully sequential active learning, as defined in [18], where the allocation of the budget is updated after each sample. Note that in both cases, the optimist algorithms built upon those criteria are fully sequential.

Data-Independent Criterion.
In this section, we characterize the optimal allocation of the budget depending only on the values of for each cluster. We want an allocation of the budget that minimizes the true risk of the classifier at time . Here, the risk is based on the binary loss: Note that this loss is usually hard to use because of its nonconvex nature. Using the partition , the true risk of the classifier is the sum of the true risks in each cluster with The risk is the mean number of misclassified instances resulting from a particular prediction of labels.
The optimal label the algorithm should assign to arm is where the expectation is taken over the samples: The value to be minimized by our allocation of the budget is then the global loss. It is the sum of losses in each cluster: The objective is now to define an allocation of the budget that minimizes this loss. However, in order to inverse the loss to retrieve the allocation, as well as to derive the online allocation strategy, the losses in each cluster have to be strictly decreasing with , and convex. This is not the case with these losses. In order to get a more convenient shape, we bound those losses by pseudolosses. The algorithms we build aim to minimize this pseudoloss instead of the loss defined previously. The idea is thus to bound the probability P([̂, ] ̸ = [ ]). We use the fact that the estimated mean in one subset follows a binomial distribution (labels are either 0 or 1). The bounds obtained this way are very tight and equal at a infinitely countable number of points.
Let I 1− ( − ⌊ ⌋, ⌊ ⌋ + 1) be the cumulative distribution function of a binomial distribution of parameters , . Then, Note that the probability given above is a step function of , /2 and thus is not a strictly decreasing function of , . That is not convenient as we require this condition in the later. That is why we bound this probability by bounding the truncated value ⌊ , /2⌋. Then,   We can see that the bound is extremely tight, and its only role is to make it strictly decreasing with , and convex. It still retains as much as possible the shape of the probability.
This defines the theoretical optimal allocation of the budget. Since we do not know the closed form for (̃, / , ) −1 Computational Intelligence and Neuroscience 5 and since an optimist algorithm needs an online allocation criterion, we now show the online allocation criterion , , is such that an algorithm sampling at each time the cluster with would result in the optimal allocation of the budget . We have seen here an optimal allocation of the budget that the optimist algorithm which will be defined in Section 4.3 could try to reach without the knowledge of the values. The criterion we derived only depends on the values of the parameters in each cluster and not the current labels given by the classifier. Considering them would lead to a better allocation since the allocation in a cluster could stop when the correct label is given.

Data-Dependent Criterion.
In this section, we show a criterion that leads to the optimal allocation of the budget depending not only on the values of in each cluster, but also on the current labels given by the classifier.
We define a new global loss that is the current regret of the true risk: The measure of performance is still the expected true risk but the value to be minimized is preferred to be run-dependent. In order to minimize it, the selection strategy samples the cluster for which the expected decrease of the loss would be maximum. This criterion is thus the finite difference of the loss where is the label resulting from the sample and the expectation is taken on . However, this is a good strategy only if this criterion is strictly increasing with , . We thus study the monotonicity of this criterion. We consider sampling + more instances in cluster with resulting average label̂+. The new label given by the classifier will be After + samples, the expected decrease of the loss is Injecting the value of + , , To shorten notations we use , = 2 , (0.5 −̂, ).
We know that̂+ is drawn from a binomial distribution of parameter and + , thus The criterion is not strictly increasing. In order to consider this constraint, we define another criterion which is a tight bound of the previous one. We first bound the following probabilities: Equivalently, The criterion resulting from this bounds is strictly increasing but is not defined for all + . Indeed, in order to change the value of the label, the estimated mean has to move to the other side of 0.5. This often requires more than one sample (e.g., if we already sampled 10 instances and 8 were labeled 1, we need at least 6 new samples to have a chance to change the label given by the classifier). In order to get a bound defined for + = 1 and strictly increasing with + , we make a linear interpolation between the value in + = | , | and the value in + = 0 which is 0.

Computational Intelligence and Neuroscience
We thus define the actual criterion: The online allocation criterion is and it is such that an algorithm sampling at each time the cluster with would result in the optimal allocation of the budget . The criterion defined in this section leads to an optimal allocation of the budget that the optimist algorithm which will be defined in the next section could try to reach without the knowledge of the values. It depends on the value of the parameters in each cluster as well as the current estimate of this parameter by the classifier.

Included Optimism.
In this section we introduce two optimistic algorithms: OALC-DI (Optimistic Active Learning for Classification: Data Independent) which use the dataindependent criterion and OALC-DD (Optimistic Active Learning for Classification: Data Dependent) which use the data-dependent criterion for optimal budget allocation defined in the previous sections. Both can be described by the same core algorithm. Neither criteria can be used as they are currently defined, for the active learning problem. Indeed, the value of in each cluster is not known in advance; otherwise, the correct label would be known as well. Also, it cannot directly replace those values by their estimation which could lead to clusters being turned down. This is a case of the exploration/exploitation tradeoff where the uncertainty about the true value of in each cluster has to be considered. Therefore, we design an optimistic algorithm that estimates those values and samples as close as possible to the optimal allocation.
Following the Optimism in the Face of Uncertainty approach, it builds a confidence interval on the criterion to be maximized and draw the arm for which the upper bound of this interval is highest. This is equivalent to saying it draws the arm for which the criterion is possibly the highest. As we know the shape of the distribution of thê, values, the confidence interval is a Bayesian Credible Interval [21] which leads to tight bounds. The Bayesian Credible Interval is relative to a probability which allows for controling the amount of exploration of the algorithm. The core algorithm is presented in Algorithm 1. It takes one parameter and can be derived in two algorithms depending on the criterion used.
Let us show how to build the Bayesian Credible Interval. As each sample is drawn from a Bernoulli distribution, the estimated means follow a binomial distribution. Beta distributions provide a family of conjugate prior probability distributions for binomial distributions. The uniform distribution Beta(1, 1) is taken as the prior probability distribution, because we have no information about the true distribution. Using the Bayesian inference, In the following , means either , from (16) or , from (27). Obviously, The upper bound of the Bayesian Credible Interval is then In this section, we have shown two optimistic algorithms that share the same core. The difference lies in the full knowledge criterion used. One depends only on the value of the parameters of the distributions. The other one depends on both the value of the parameters and the current estimates of this parameter by the classifier. Both the resulting algorithms depend only on the estimates of the parameters.
The problem solved by those algorithm is the one that finds the best label to give to several separated clusters. This separation comes from the partition of a continuous instance space. A good hypothesis would be that the values do not vary fast and that neighbor clusters have close values of . In order to speed up learning and to increase generalization, we could estimate considering neighbor clusters. This is the subject of next section.

A Set of Partitions.
The previous section introduces an active learning algorithm which is based on a partition of the instance space. Supposing this partition is given, it defines the Computational Intelligence and Neuroscience 7 best allocation of samples among its clusters that lead to the lowest true risk of the classifier also based on this partition. The best performance of the classifier still highly depends on the choice of the partition, which has no obvious solution. One way to improve the classifier's performance is to increase the number of clusters in the partition. But this slows learning as each cluster parameter has to be estimated independently.
To counter that, we allow the classifier to generalize by letting neighbor clusters share information. In order to use the same approach as before, we consider the case of a committee of partitions. Each partition estimates the parameter of their clusters independently. Then, the local prediction of the label is determined by averaging the estimations of each partition.
Let N be a set of partitions of the instance space: where ∀ ∈ {1, . . . , }: with the following properties: Each partition may have a different number of subsets .
These partitions may come from random forests [22] or tile coding which is a function approximation method commonly used in the field of reinforcement learning [23]. The partitions must not change during the progress of the algorithm.
Let us now define the thinnest partition N, which is the partition resulting from overlapping all the partitions from N: This means that two elements coming from the same subset of the thinnest partition necessarily come from the same subset in any partition of the set.
Each cluster of this thinnest partition is associated with a Bernoulli distribution of parameter .
We writê= (1/ ) ∑ =1 1 ∈X , the average label in cluster , and = ∑ =1 1 ∈X , the number of samples in cluster of the thinnest partition. We write , , , the relative importance of cluster of the thinnest partition in cluster of partition . Note that of the partition of a two-dimensional instance space. The set is composed of four partitions of four clusters each. The resulting thinnest partition is, in this case, composed of nine clusters. The dots represent the unlabeled instances with which the weights of each clusters of each partition are computed. The influence of a cluster of the thinnest partition in the estimation of the mean value in one cluster of a specific partition of the set is also computed. What is not shown is how the influence of this last estimation on the final prediction is computed. It is 1/4 in this case because there are 4 partitions in the set.
The prediction of the label is cluster that results from the averaging of estimations of each partition: This can also be written as where the elements of are ∀( 1 , 2 ) ∈ ⟦1, K⟧ 2 Note that the size of the matrix depends on the number of subsets in the thinnest partition. It may be very large if the initial partitions are not constrained. This is not a problem since a subset of the thinnest partition containing no instance from the training set causes its corresponding column of to be null. It can therefore be removed. The size of is thus limited by the number of instances in the training set.
In the active learning setting with a pool-based sampling scheme, the labels of the instances in the training set are not known in advance but are acquired sequentially. Although the pool of unlabeled instances is known initially, this allows us to compute the matrix at the beginning of our algorithm and keep it until the end. This is different from the case where we consider only data already sampled, as done in Random Forests, and the matrix has to be recomputed at each step.
Each cluster X ∈ N is an arm of a multiarmed bandit characterized by a Bernoulli distribution ] with mean value 8 Computational Intelligence and Neuroscience . At each round, or time step, ≥ 1, an allocation strategy selects an arm ∈ ⟦1, K⟧, which corresponds to picking an instance randomly in the cluster X and receives a sample ∼ ] , independently of the past samples. ( ) ∈⟦1,K⟧ denote the weight of each cluster.
Then, if , = ∑ =1 1 ∈X is the number of samples taken in subset up to time and̂, = (1/ , ) ∑ =1 1 ∈X if , ̸ = 0 and 0.5, otherwise, is the mean of labels taken up to time .
Let us define the classifier that gives to subset the label Note that it gives the same label as (37) where , = [∑ K =1 ,̂, ], and the only difference is that the value inside the [⋅] is reweighted such that it is 1 when all̂, are equal to 1.

Full Knowledge Criterion.
In this section, we define the allocation of the budget in the full knowledge setting for the case of information sharing clusters. We also introduce a criterion that leads to this allocation. Again, those parameters are only used to define the allocation of the budget and not for the prediction of labels.
In this problem, the clusters are not independent. This means sampling an instance in a cluster affects the prediction of other clusters. The criterion has shown here the results of the myopic minimization of the true risk. With the weights of each clusters being estimated from the number of instances in the pool (labeled and unlabeled), the true risk is computed as the risk on the pool. To decide the next cluster to sample, we simulate sampling in each cluster and evaluate its expected impact on the risk. The selected cluster is thus the one which lowers the risk most.
Here, the risk is based on the binary loss: Note that this loss is usually hard to use because of its nonconvex nature. First, suppose that the label , is given to each subset by the classifier at time . The true risk encountered in each subset is Note that the best risk is attained when , = [ ].
Then, the true risk in each subset can also be written as follows: Knowing that the probability for the next sample taken in subset to be 1 is , we have the following expected decrease in the risk incurred by sampling a new instance in subset that we note With the global risk being its expected decrease relatively to a new sample in subset is We then make a myopic minimization of the risk. Thus, at time , our full knowledge algorithm samples the subset where is the criterion for the online allocation of the budget . Through this section, we have seen an online criterion, based on the maximum expected decrease in terms of risk, which a full knowledge algorithm would use to sample. The hypothesis made about the knowledge of the values is unrealistic because if they were known, the classification would be obvious; however, it allowed us to determine a good allocation strategy that our partial knowledge algorithm tries to attain. In the next section, we remove this hypothesis and use an optimistic approach to estimate the and at the same time allocate samples as close as possible to the full knowledge allocation.

Included Optimism.
In this section, we introduce two optimistic algorithms based on the full knowledge criterion of the previous section. The values of are not given, so the selection strategy cannot use them directly. Although the labels acquired during the sampling process allow us to estimate them. The simple replacement of the true values with their estimates is unjustified and could lead to a bad allocation.
Instead, we should consider the exploration/exploitation tradeoff. More than the only estimates, we are able to compute a distribution on the belief of . A Bayesian Credible Interval, relatively to a probability , can thus be computed on the values of , as well as on the criterion.
Referring to the Optimism in the Face of Uncertainty approach, we define a selection strategy that samples at every time step the cluster for which the upper bound of the Bayesian Credible Interval is highest. This leads to an allocation of the samples as close as possible to the Full Knowledge allocation even though the true value of is not known.
Our first algorithm is OEMAL (Optimistic Error Minimization for Active Learning). It uses the set of partition as defined in the previous section. The classifier is only defined by the matrix representing the influence of clusters on each other. The number of clusters in the thinnest partition is not limited. One particular case of OEMAL is when its clusters contain at most one instance from the pool. This makes act as the covariance matrix used in kernel methods.
Our second algorithm is OEMAL-k (Optimistic Error Minimization for Active Learning: kernel version). It takes as input any covariance matrix and use it as the matrix in OEMAL. This broadens the scope of the classifiers that can be used. Note that, in order to use Optimism in the Face of Uncertainty in a proper way, the matrix still must not change. The rest of the section works for both algorithms.
Our algorithm is displayed in Algorithm 2. It takes one parameter which allows us to control the level of exploration used by our algorithm.
At any time, we are able to compute a Bayesian Credible Interval on the value of the parameters of each cluster. Each subset of the thinnest partition is associated with a Bernoulli distribution of parameter . Thus, the estimated means are drawn from Binomial distributions. Beta distributions provide a family of conjugate prior probability distributions for Binomial distributions. With the prior probability distribution being Beta(1, 1) (uniform distribution), by Bayesian inference, However, this inference does not consider observations from the neighbor clusters. Let us define m = ( 1 , . . . , K ) with ∀ ∈ ⟦1, K⟧, which is the decision criterion of the classifier defined in (39). In order to get an early guess about the value of , we chose to infer m instead, which aims to approach , and use it in place of . Even though m ̸ = and we may lose some accuracy, this is necessary for our algorithm to work well.
First, let us state thatm results from a sum of independent Binomial distributions; therefore assuming that the number of nonnull elements in each row of is large enough, we can approximate thatm follows a normal distribution.
The normal distribution provides a family of conjugate prior probability distributions for the normal distributions; thus, our belief about m follows a normal distribution with a mean of = ( 1 , . . . , K ) with ∀ ∈ {1, . . . , K}, and a standard deviation s = ( 1 , . . . , K ) with ∀ ∈ {1, . . . , K}, with V = ( 2̂( 1 −̂) + + 1)/( + 2) 2 ( + 3), the variance of the Beta distribution. Then, This algorithm, simulates the sampling of a new instance in order to estimate the resulting gain in risk. Within each simulation the distribution of the belief is updated by taking into account the new sample. Then, this new distribution of the belief is used to compute a distribution on the gain in risk.
Let us note is the label of the instance sampled in the simulation and In order to compute this value we define two methods.

Method 1 (only for OEMAL). (i) Draw a high number of instantiations of the belief from P(m |̂).
(ii) Compute for each case the resulting value of the criterion.
(iii) Estimate the distribution of the criterion.
In our experiments, the set of partitions N we used is such that every cluster of the thinnest partition contained at most one instance from the pool. Thus, the value of is contained in {0, 1}. with and , (̂) and , (̂) from (39).
The simulation of a sample can lead to two states of the classifier. Depending on this state, the true risk is the sum of the true risk in each cluster, with at most two cases each ( ∈ {−1, 1} or = 0), depending on the value of in each cluster. We can then compute the probability of a value of the criterion by combining the probability of each case. As the difference of the true risk for one cluster is 0 whenever the prediction of the classifier does not change, we only consider clusters for which the prediction of the classifier changes. With this, the computation of the distribution of the criterion can be done in a reasonable time.
Having computed the distribution of the criterion, we can compute the upper bound of the Bayesian Credible Interval, which is = arg min P ( , > |̂, T ) ≤ . (61) In this section, we derived two optimistic algorithms that address the active learning problem for classification. OEMAL considers a set of partitions of the instance space as well as the thinnest partition, resulting from overlapping this set. OEMAL-k considers a kernel. They both find the cluster of the thinnest partition for which sampling in it results in the greatest decrease of the true risk.

Computational Complexity.
Let K be the number of clusters of the thinnest partition, and let K be the number of clusters with at least one unlabeled instance. The computation of the criterion requires O(K 2 ) time complexity.
Using Method 1, let its be the number of instantiations of the belief. The selection of the next cluster thus requires O(K 2 K its ) time complexity. Indeed, at each time step, it computes the criterion for its values of the parameters drawn from the posterior and for the simulation of sample in each one of K clusters.
In the case where each cluster contains only one instance from the pool, the decrease in risk in one cluster can be 0 if the predicted label does not change, or 1 or −1 depending on the true label, if it changes. Let changes be the number of clusters seeing its label change for the considered simulation of sample.
Using Method 2, the computational complexity of the combinatorial procedure is O( 2 changes ). Thus, the selection of the next cluster requires a computational complexity of O(K 2 K 2 changes ). Added to the fact that Method 2 is more precise than Method 1 because the computed cumulative distribution is exact, it is also faster, because changes rapidly decreases while acquiring more samples.
Note that the number of partitions in the set is not involved in the computational complexity. Indeed, they are only involved in the computation of the relations between subsets of the thinnest partition which is made beforehand. This is another advantage compared with the use of random forests, where each tree has to be recomputed at every time step.

Evaluation
In this section, we evaluate the algorithms introduced in the previous sections. Each of the evaluation will be the scope of a comparison between our algorithms and state-of-the art algorithms. Algorithms are evaluated on two different benchmarks depending on the classifier used.
In the case of independent clusters, either the instance space is already discrete, or it is continuous and must be partitioned. In the latter, the partition has to be chosen before taking any samples and cannot be changed after that. The predicted label in each cluster depends only on samples drawn in it. Thus, the number of clusters is limited by the budget of samples and by a desired good learning rate. This makes the partition rough and the classifier noncompetitive on real world datasets. However, the small number of parameters needed to represent any real world problem allows us to build a representative benchmark.
In the case of information sharing clusters, a partition is also given beforehand but the predicted label in each cluster may depend on all the samples. The number of clusters is thus not limited as before, and an extremely refined partition, where each cluster contains only one instance from the pool, may be used. Thus, this algorithm may be evaluated on real world problems with a continuous instance space.
6.1. Independent Clusters 6.1.1. Practical Implementation and Experimental Setup. Two optimistic algorithms were introduced in the context of independent clusters. Each one based on different full knowledge criteria: the first one defines a one-shot allocation of the samples, which means that the optimal number of samples to take in each cluster is only function of its parameter and of the budget, and the second one defines a fully sequential allocation of the samples, which means that the optimal number of points to take in each cluster depends also on the samples drawn.
The performance of the algorithms closely depends on the choice of the partition. Clusters may regroup instances with a great dispersion of labels, either because the instances are subject to a lot of noise, or because the mean label varies rapidly. In this second case, a better discretization of the instance space causes a better true risk when the correct label is given. But this implies increasing the number of clusters, and for the prediction of the label to be equivalently accurate, we need a higher budget. As the problem of the choice of the partition is not studied here, the use of our algorithm on real world datasets with a continuous instance space would not be competitive with methods designed for this problem. Instead, we show that, given any partition, the algorithm performs better than other algorithms with the same constraints.
In this problem, where the classifier predicts one label per cluster, the location in the cluster of the sampled instance is of no interest. Therefore, every cluster can be seen as a pool of instances returning labels with a certain proportion. With the labels being either 0 or 1, this leads to a representation of clusters as Bernoulli distribution only characterized by one parameter. The relative position of the clusters is also of no interest for the classifier. Consequently, every problem our algorithm could encounter is only characterized by If two problems share the same characteristics, our algorithms will act the same on them.
Our first benchmark is thus to generate randomly a set of problems by drawing random parameters from the above definition. Then, the tested algorithms are launched on each problem of the set, and their true risk is recorded at every time step. The current predicted label for each cluster is compared to the correct one (the round value of ), and the true risk is the weight sum of this logical comparison. Note that because the problems are built in, the true parameter is known and we do not need a test set. Likewise, the samples are directly drawn from the distributions and not drawn from a pool. The global performance of the algorithms results from averaging the true risk at each time step. In our experiments, the benchmark contains 1000 problems generated with (i) drawn uniformly in ⟦1, 50⟧, The weights are normalized, but this is just anecdotal as it does not affect the behavior of the algorithms. We use a budget of 1000 samples. The results are displayed with the time step range starting at 100 in order to be able to differentiate algorithms.
Our second benchmark comes from the fact that not all state-of-the-art algorithms consider the weight given to clusters. In order to make sure that the good performance of our algorithms is not only due to this consideration, in this benchmark the weights are equal for all clusters. The problem of allocation consists in defining which cluster has priority against another. Thus, any problem is a subproblem of the one containing an infinite number of clusters containing all the values possible for the parameter. Then, we generate one problem containing many clusters with the widest variety of parameters. In our experiments, this problem is generated with (i) = 100, (ii) ∀ ∈ ⟦1, ⟧, = / , (iii) ∀ ∈ ⟦1, ⟧, = 1/ .
We run the algorithms 1000 times on this problem to face the randomness of the samples and average their true risk at every time step. The results are displayed with the time step range starting at 100. To demonstrate the effectiveness of our algorithms, we compare them with existing state-of-the-art methods.
Random Sampling. This is the simplest baseline; at every time step the sampled cluster is drawn uniformly in ⟦1, ⟧. Any active learning algorithm should do better than that.
Uniform Sampling. This is another simple baseline, and the clusters are sampled uniformly. At every time step the sampled cluster is drawn uniformly among the least sampled ones. [20]). This algorithm also uses an optimistic approach, although it is not originally designed for classification. It aims to estimate uniformly well the parameter of the distribution in each cluster. We use this estimation to predict the label. This algorithm also considers a weight for each cluster.

Monte Carlo Upper Confidence Bound (MC-UCB) (see
EffECXtive (see [12]). This algorithm tries to identify the best hypothesis from a set by successively sampling the instance for which the expected reduction in weight is the greatest. The weight is defined as the sum of the squared probabilities of 12 Computational Intelligence and Neuroscience hypotheses. In our case, a hypothesis is characterized by the label given to each cluster, and the probability of a hypothesis is where P([ ] = 1 |̂) = I 0.5 (̂+ 1, (1 −̂) + 1) with I ( , ) being the regularized incomplete beta function. The three optimistic algorithms evaluated in this section, namely, MC-UCB, OALC-DI, and OALC-DD, use a parameter . This parameter has been tuned by using a grid search.

Evaluation.
First, we evaluate the performance of the algorithms on Benchmark 1. The results of the evaluation are displayed on Figure 3(a).
We first compare MC-UCB with OALC-DI. MC-UCB is an optimistic algorithm which tries to estimate uniformly well the mean value of the distribution in each cluster. This can be related to active learning in the case of regression. A good estimate of the mean value leads to a good prediction for the label. Therefore, this algorithm may be used on a classification problem even though it may not be the best. Indeed, it will spend effort to be precise on the estimation of the mean value even if this precision is of no use for the decision of the label. Those efforts could have been spent in a different cluster where the label uncertainty is larger. OALC-DI is the closest of our algorithms to MC-UCB as they both consider a Full Knowledge criterion that gives the optimal allocation of the budget without knowing the results of the samples. The two differ in that OALC-DI is specifically designed for classification. We can see that OALC-DI shows a significant improvement over MC-UCB, which indicates that working with an algorithm adapted for classification is not to be neglected.
We then compare OALC-DI and OALC-DD. Those two algorithms aim both to minimize the same objective function, which is the expected true risk of the classifier. Note that using directly the true risk based on the binary loss is usually avoided by minimizing a convex proxy of it. Although the second one is based on a full knowledge criterion that takes the current state of the classifier, depending on the results of the samples so far, into account whereas the first one is not. We can see that, as expected, the version of our algorithm based on a data-dependent full knowledge criterion behaves better than the one based on a data-independent one. Note that even though the difference in performance is only of 0.0016 at time step 1000, which means that 0.16% of the instances will be classified better, the number of time steps required to attain the performance of OALC-DI at time step 1000 is of 625 for OALC-DD, which is a save of 37.5% of the labeled instances.
Finally, let us take a look at the performances of OALC-DD and EffECXtive. Both algorithms are very similar as they greedily minimize an expected objective function. The difference is that EffECXtive uses a proxy of the true risk, the Rényi entropy, whereas OALC-DD, by allowing it to depend on the true parameters of the distribution before being optimist regarding it, is closer to the actual true risk.
Note that EffECXtive, in our adaptation to a partition of the instance space, does not take into account relative importance given to clusters (weights), which are not trivial to include. The results show that OALC-DD performs slightly better than EffECXtive.
Let us now evaluate the performance on Benchmark 2. The results are displayed in Figure 3(b). This problem contains more clusters than any other problem in Benchmark 1. A higher budget is thus needed to attain comparable performances. Still, the range of time steps does not change. This allows us to focus on the first phase of the algorithms. In this problem, the weights are equal for all clusters, so that the quality of an algorithm is not only based on the fact it has this feature. In this problem, the best true risk is equal to 0.25.
Note that this range starts at 100 labeled examples. But since most algorithms need every cluster to be sampled at least once, at time step 100, each one of the 100 clusters will be sampled once, and this is for every algorithm, apart from random sampling. This is why every algorithm performance starts at the same level. We thus do not lose much by starting at this time step. The results of the first sample are 0 or 1 leading to the same precision on the prediction, and some algorithms need every cluster to be sampled twice, and this is why the performance at time step 200 is the same for most algorithms. EffECXtive performance at 100 time step is not the same as others as it samples indifferently clusters with no samples and clusters with one sample. But it has the same performance as others at time step 200. OALC-DD prefers to sample clusters that have received two samples 0 and 1 than one sample, which appears as a good behavior relatively to the results.
We can see that OALC-DD has better performance than all other algorithms at every time step.

Practical Implementation and Experimental Setup.
Two other optimistic algorithms were introduced. Instead of one partition, OEMAL uses a set of partitions which all compute the estimates independently and merge to predict the final label. The thinnest partition was defined, which is the partition resulting from the intersection of all the partitions in the set. We have seen that we could use a representation of the classifier which involves only the clusters of the thinnest partition and a matrix which tells how much the estimate in one cluster plays a role in predicting the label in another cluster. Using a set of partitions is thus equivalent to one partition with clusters that share information.
The classifier used by our algorithm shares some resemblance with Random Forests [22] as it uses a set of partitions and average of the prediction criterion of each one of them. In our algorithm, the set of partitions has to remain the same throughout the progress. Whereas in Random Forests, they are recomputed at each step. Hence, it cannot adapt to the received labels and must be defined at the beginning. We use purely random partitions of the instance space that are not based on the instances in the pool. It is thus more closely related to tile coding which is a function approximation method commonly used in the field of Reinforcement Learning [23]. The nature of the partitions in the set and its number was not defined in the algorithm. In fact, the algorithm works given any set of partitions. The performances of the classifier clearly depend on the partitions in the set; for example, if all the partitions in the set are the same, then the problem is reduced to the one partition problem. But, as long as partitions are diverse, their shape is not as determinant as before. In the context of one partition, the number of clusters in it could not be too large because the belief on the parameter which guided the selection strategy was only based on observations in its cluster. Now, the number of clusters in the thinnest partition has no limitation. This allows us to work with a continuous instance space without the loss in performance incurred by the choice of the partition.
The partitions we use consist of 7 successive random splits of the instance space along random dimensions. The number of partitions in the set is 10.000. At the end, the clusters of the thinnest partition contain either 1 or no instance from the pool.
OEMAL-k replaces the matrix in OEMAL by a covariance matrix relatively to a kernel. In the experiments, we use a Gaussian kernel covariance matrix: The scale parameter has been tuned to give the best performance with a full training set. We thus evaluate our algorithm on several real world datasets from the UCI Machine Learning Repository [24]. The four datasets used in this paper are Australian, Diabetes, Heart, and Wdbc. In all those datasets, the instances belong to a continuous instance space. In each run of experiments, the dataset is randomly divided into two. The first half is used as the pool of unlabeled instances in which the algorithm is allowed to pick, while the second half is used as the test set. At each time step, the true risk of the current prediction is estimated via the test set and recorded. The global performance of an algorithm at each time step is computed as the average of the performance among the runs. In our experiments, the number of runs is set to 1.000. The parameter is tuned for every dataset using a grid search.
We compare our algorithm with existing state-of-the-art method and some baselines.
Random Sampling. This is the simplest baseline. At each time step a random instance is drawn from the pool of unlabelled instances.
Full Knowledge. This is the best algorithm we can do. At each time step, an instance is selected according to the full knowledge criterion. The values of the parameters are used; thus, it is unrealistic but it serves to show how well the exploration/exploitation tradeoff is achieved.
Uncertainty Sampling (see [7]). This is the most common active learning algorithm. At each time step, the instance which is most uncertain about how to label is selected. Figure 4 displays the results of the evaluation for the four following datasets: Figure 4(a) Australian, Figure 4(b) Diabetes, Figure 4(c) Heart, and Figure 4(d) Wdbc.

Evaluation.
We built our optimistic algorithm by first defining a full knowledge criterion which guides the allocation of samples in the case where the true values of the parameters are known from the beginning. This is then used by the algorithm as a target allocation to attain in the case where those values are unknown. The performances of OEMAL are thus limited by those of the full knowledge allocation. We display the performances of the Full Knowledge allocation as a baseline  for the Optimistic algorithm. If they both have the same performances, then the only way to improve the algorithm is to design a better Full Knowledge criterion. Otherwise, the way to manage uncertainty is to improve. In other words, it tells if the exploration/exploitation tradeoff is well achieved. The full knowledge allocation, for its part, sees its performances limited by the classifier, which may not have better performances given any set of instances of size given by the time step. Also, whereas in the dependent clusters case the full knowledge allocation was known to be optimal, meaning that we could not achieve better performances with a different allocation, this is not the case anymore. Indeed, the myopic minimization of the risk has no guaranty to lead to the optimal allocation. For example, the best performances of the classifier at time step 2 could be achieved by the inclusion of a pair of instances that does not contain the instance that leads to the best performance at time step 1. Although the empirical results show that using a myopic minimization in full knowledge performs quite well, we can see that on the Wdbc dataset Figure 4(d), for a short period around 55 labeled instances, uncertainty sampling achieves slightly better performance than the full knowledge allocation. The optimistic algorithm can not thus do better on this period. Still, we can see that it outperforms it on the 20 first samples as well as it keeps high performance on the end while Uncertainty Sampling seems to lose accuracy. This last phenomenon can be explained. Let us look at Figure 5 where we display the results for the Wdbc dataset with the range of time steps extended to show the behavior of the algorithms until the last sample of the pool is retrieved. We can see the performance of all the algorithm decreases while approaching the end. We may think that the performance can only increase with the number of samples taken into account by the classifier. We know that this is not necessarily true, particularly if the classifier overfits. Also, in active learning, one can select a subset of instances that achieves better performance than when taking all the instances from the pool. The question of a stopping criterion has already been studied in order to avoid spending useless resources. Here, we see that it is even more crucial because it could lead to better performances. In OEMAL, the criterion used represents the maximum decrease of the true risk one can expect by taking a particular sample. Thus, if the value of the criterion is less than 0, this means that there is a high probability (1 − ) that the true risk will not decrease but increase. In this case, it is preferable not to sample this instance. But if the maximum criterion is less than 0, it is preferable not to sample at all. We use this as a stopping criterion. We compare the final performances of our algorithm with and without the stopping criterion for different datasets in Table 1. Two of the four datasets see their true risk increase at the end, Diabetes and Wdbc. We can see that the use of a stopping criterion is efficient in those cases and does not greatly alter the performances in other cases.
The matrix appearing in the classifier we used so far is derived from the use of a set of partitions. The value in each cell of the matrix corresponds to the weight of each estimation in each prediction. We saw that it was possible to use clusters of the thinnest partition that contain only one instance from the pool. Instances are linked to others through . Inherently, in this case fully specifies the data manifold structure. The theory that leads to OEMAL is built upon this set of partition, but we can imagine using a matrix of any kind, specifying other weights between instances. OEMAL may still work in this context. Particularly, we can adapt this algorithm to the active learning of kernel classifiers. We thus use a Gaussian kernel covariance matrix for . We now denote the kernel version of OEMAL by OEMAL-k. The results are displayed in Figure 6.
We can see that OEMAL-k always does better than uncertainty sampling. Even in the Wdbc dataset where uncertainty sampling performs worse than random sampling, OEMALk manages to get better performance. One important thing in active learning is the choice of the classifier. With a wellfitted classifier, even the random sampling strategy could perform better than the best active learning strategy on a poor classifier. It is thus convenient that our algorithm is not limited only to one kind of classifier and can easily generalize to kernel classifiers. For example, on the Wdbc dataset Figures 4(d) and 6(d), the random sampling strategy performs significantly better when using a kernel than a set of partitions, and OEMAL-k keeps improving the results. On the other hand, on the Diabetes dataset Figure 6(b), the random sampling strategy is also better with the kernel classifier, but OEMAL performs better than OEMAL-k. As we have seen in Figure 5, the active learning strategy of the best performance for the classifier may be achieved with only a subset of instances. Maybe the former classifier had a better potential than the last. This is confirmed by the performance of the full knowledge criterion.
In this section, we evaluated the performance of OEMAL which is built on this approach on several real world datasets. Thus, we demonstrated that the Optimism in the Face of Uncertainty approach can be used for active learning in classification. We saw that it performed comparatively well to a famous state-of-the art algorithm. We also evaluated OEMAL-k and showed that our algorithm could be generalized to kernel methods or any graph based method where the instances are linked to otherr by a weight matrix .

Conclusion
In this paper, we show that the problem of active learning in classification can be studied through the eye of Optimism in the Face of Uncertainty. This has the advantage to allow the selection criterion to be defined as close as possible to the evaluation function. It introduces three error minimization algorithms which use this approach. The experiments, conducted on built-in problems as well as real world datasets, 16 Computational Intelligence and Neuroscience show that they perform comparatively well to state-of-theart methods. An extension of the last algorithm shows that it can be generalized to other kernels. We however constrained the matrix to remain the same all along the progress of the algorithm. Our perspective is to work with a changing matrix such as in kernel regression or Gaussian processes where the variance of the estimates is given.