Online Manifold Regularization by Dual Ascending Procedure

We propose a novel online manifold regularization framework based on the notion of duality in constrained optimization. The Fenchel conjugate of hinge functions is a key to transfer manifold regularization from offline to online in this paper. Our algorithms are derived by gradient ascent in the dual function. For practical purpose, we propose two buffering strategies and two sparse approximations to reduce the computational complexity. Detailed experiments verify the utility of our approaches. An important conclusion is that our online MR algorithms can handle the settings where the target hypothesis is not fixed but drifts with the sequence of examples. We also recap and draw connections to earlier works. This paper paves a way to the design and analysis of online manifold regularization algorithms.


Introduction
Semisupervised learning (S 2 L) of different classifiers is an important problem in machine learning with interesting theoretical properties and practical applications [1][2][3][4][5].Different from standard supervised learning (SL), the S 2 L paradigm learns from both labeled and unlabeled examples.In this paper, we investigate the online semisupervised learning (OS 2 L) problems which have three features as follows: (i) data is abundant but the resources to label them are limited; (ii) data arrives in a stream and cannot even store them all; (iii) no statistical assumptions are found, which means that (x, ) can change over time.
OS 2 L algorithms take place in a sequence of consecutive rounds.On each round, the learner is given a training example and is required to predict the label if the example is unlabeled.To label the examples, the learner uses a prediction mechanism which builds a mapping from the set of examples to the set of labels.The quality of an OS 2 L algorithm is measured by the cumulative loss it makes along its run.The challenge of OS 2 L is that we do not observe the true label for unlabeled examples to evaluate the performance of prediction mechanism.Thus, if we want to update the prediction mechanism, we have to rely on indirect forms of feedback.
Lots of OS 2 L algorithms have been proposed in recent years.A popular idea [5,6] is using a heuristic method to greedily label the unlabeled examples, which is essentially still employing an online supervised learning framework.References [7][8][9] also treat OS 2 L problem as online semisupervised clustering in that there are some must-links pairs (in the same cluster) and cannot-links pairs (cannot in the same cluster), but the effects of these methods are often influenced by "bridge points" (see a survey in [10]).
For solving OS 2 L problem, we introduce a novel online manifold regularization (MR) framework for the design and analysis of new online MR algorithms in this paper.Manifold regularization is a geometric framework for learning from examples.This idea of regularization exploits the geometry of the probability distribution that generates the data and incorporates it as an additional regularization term.Hence, the objective function has two regularization terms: one controls the complexity of the classifier in the ambient space and the other controls the complexity as measured by the geometry of the distribution.
Since decreasing the primal MR objective function is impossible before obtaining all the training examples, we propose a Fenchel conjugate transform to optimize the dual problem in an online manner.Unfortunately, the basic online MR algorithms derived from our framework have to store all the incoming examples and the time complexity on each learning round is ( 2 ).Therefore, we propose two buffering strategies and two sparse approximations to make our online MR algorithms practical.We also discuss the applicability of our framework to the settings where the target hypothesis is not fixed but drifts with the sequence of examples.
To the best of our knowledge, the closest prior work is an empirical online version of manifold regularization of SVMs [11].Their method defines an instantaneous regularized risk to avoid optimizing the primal MR problem directly.The learning process is based on convex programming with stochastic gradient descent in kernel space.The update scheme of this work can be derived from our online MR framework.
This paper is structured as follows.In Section 2 we begin with a primal view of semisupervised learning problem based on manifold regularization.In Section 3, our new framework for designing and analyzing online MR algorithms is introduced.Next, in Section 4, we derive new algorithms from our online MR framework by gradient ascent.In Section 5, we propose two sparse approximations for kernel representation to reduce computational complexity.Connections to earlier analysis techniques are in Section 6.Experiments and analyses are in Section 7. In Section 8, possible extensions of our work are given.

Problem Setting
Our notation and problem setting are formally introduced in this section.The italic lower case letters refer to scalars (e.g.,  and ), and the bold letters refer to vectors (e.g.,  and ).(x  ,   ,   ) denotes the th training example, where x  ∈ R  is the point,   is its label, and   is a flag to determine whether the label can be seen.If  = 1, the example is labeled; and if  = 0, the example is unlabeled.The hinge function is denoted by [] + = max{, 0}.⟨, x⟩ denotes the inner product between vectors  and x.For any  ≥ 1, the set of integers {1, 2, . . ., } is denoted by [𝑡].
Consider an input sequence (x 1 ,  1 ,  1 ), (x 2 ,  2 ,  2 ), . . ., (x  ,   ,   ), where x  ∈ R  and   ∈ {0, 1} ( ∈ {1, 2, . . ., }).Let  be a kernel over the training points x and H  the corresponding reproducing kernel Hilbert space (RKHS).The S 2 L problem based on manifold regularization [12] can be written as minimizing where  ∈ H  , ‖‖ 2  is the RKHS norm of , ℎ is a loss function for the predictions of the training points,  1 and  2 are trade-off parameters, ((x  ), (x  )) is the distance function which measures the difference between the predictions of x  and x  , and   are the edge weights which define a graph over the  examples, for example, a fully connected graph with Gaussian weights   =  −‖x  −x  ‖ 2 /2 2 or -NN binary weights.
In (1), the objective function () can be composed of three sums.The first sum measures the complexity of , the second measures the loss for labeled examples, and the last one is the manifold regularizer which encourages prediction smoothness over the graph which means that similar examples tend to have same predictions.
Denote that  * = argmin ∈H  ().Obviously, it is easy to seek  * using existing optimization tools after all the training examples arrived, which is called offline MR.Different from offline methods, an online MR process is performed in sequence of consecutive rounds.On each round, when an example (x, , ) arrives, the online MR algorithm is required to present its predictive label and update its prediction mechanism so as to be more accurate later.
For simplicity and concreteness, we focus on semisupervised binary linear classifiers in this paper, which means that (x) = ⟨, x⟩ and the data labels belong to {−1, +1}.ℎ is chosen as a popular convex loss function in supervised classification: hinge-loss, defined as The function ((x  ), (x  )) is defined as an absolute function in this paper, where Furthermore, (3) is composed of two hinge functions (see Figure 1 for an illustration) as follows: To learn a max-margin decision boundary, we can rewrite (1) as Let edge weights and we can get a simple version of (5), as The minimization problem of (6) in an online manner is what we consider in the rest of this paper.

Online Manifold Regularization in the Dual Problem
In this section, we propose a unified online manifold regularization framework of semisupervised binary classification problems.Our presentation reveals how the S 2 L problem based on MR in Section 2 can be optimized in an online manner.Before describing our framework, let us recall the definition of Fenchel conjugate that we use as a main analysis tool.The Fenchel conjugate of a function  : dom  → R is defined as Specially, the Fenchel conjugate of hinge functions is a key to transfer manifold regularization from offline to online in this paper.
, where for all  ∈ {1, 2, . . ., },   ∈ R, and Proof.We first rewrite the () as the following: where   ∈ [0, 1] for all  ∈ {1, 2, . . ., }.Based on the definition of Fenchel conjugate, we can obtain that Since the third equality aforementioned follows from the strong max-min property, it can be transferred into a minmax problem.
Back to the primal problem, we want to get a sequence of boundary  0 ,  1 , . . .,   which makes ( 0 ) ≥ ( 1 ) ≥ ⋅ ⋅ ⋅ ≥ (  ).In (6), decreasing the objective function () directly is impossible in the condition of not getting all the training examples.In practice, we only get the example set {(x 1 , where  *  is the Fenchel conjugate of   .The primal problem can be described by Fenchel conjugate transform as follows: In (14), we can see that our goal has been transferred from minimizing the primal problem () to maximizing the dual function ( 1 ,  2 , . . .,   ).In the following, we show how to ascend the dual function without the unobserved examples.
Based on Proposition 1, the Fenchel conjugate of   () is if And our online MR task can be redescribed as ascending the dual function () by updating the coefficient vector .Obviously, unobserved examples would make no influence on the value of dual function in (16) by setting their associate coefficients to zero.
Denote   to be the coefficient vector  on round , and its elements can be written as ( 10 )  , ( 20 )  , ( The first one means that the unobserved examples do not make influence on the value of dual function (  ), and the second means that the value of dual function never decreases along the online MR process.Therefore, the dual function on round  can be written as Based on Lemmas 2 and 3 in the appendix, we can obtain that each coefficient vector  has an associated boundary vector .On round , the associated boundary vector of   is Using a more general form, the associate vector   in (18) also can be written as where )  .To make a summary, we propose a template online MR algorithm by dual ascending procedure in Algorithm 1.
Algorithm 1: A template online manifold regularization algorithm for semi-supervised binary classification.Based on dual ascending procedure, this template algorithm aims for an increment of the dual function on each round.

Deriving New Algorithms by Gradient Ascent
In the previous section, a template algorithm framework for online MR is proposed based on the idea of ascending the dual function.In this section we derive different online MR algorithms using different update schemes of coefficient vector  in the dual function.
Let   denote a subset of dual coefficients and  is an element of coefficient vector .Our online MR algorithms simply perform a gradient ascent step over   ( ∈ {1, 2, . . ., }) on round  that aims to increase the value of dual function: where  ∈   and   ≥ 0 is a step size.We now propose three update schemes which modify different coefficients on each learning round.

Example-Associate (EA) Update.
In traditional online supervised learning, the prediction mechanism is always updated only using the new arrived example, for example, Perceptron.Based on this notion, we propose an exampleassociate update scheme to ascend the dual function by updating the associated coefficients of the new training example (x  ,   ,   ) on round  that means do not need to be grounded to zero on round .Based on Proposition 1, we have already obtained that every element of coefficient vector  belongs to [0, 1].Using a gradient ascent step in (20), the example-associate update process can be written as 22) and ( 23) also imply that the gradient ascent must satisfy ()  ≥ 0, and otherwise we do not perform a gradient ascent on .
Unfortunately, this update scheme will not work in practice because it needs to store every input point to update the boundary vector; it also has an increasing time complexity ().Here, we propose two buffering strategies to use a small buffer of examples on each learning round.Denote that   ⊆ [ − 1], and the example (x  ,   ,   ) belongs to the buffer on round  if  ∈   .
(i) Buffer-.Let the buffer size be .-buffer replaces the oldest point x − in the buffer with the new incoming point x  after each learning round, which means that   = { − ,  −  + 1, . . .,  − 1}.
(ii) Buffer-.This buffering strategy replaces the oldest unlabeled point in the buffer with the incoming point while keeping labeled points.The oldest labeled point is evicted from the buffer only when it is filled with labeled points.
Based on the previous analysis, the sub set of dual coefficients   can be chosen using the process in Algorithm 2.
Denote  max
We also can rewrite the update process using the form of (19) as follows: The new associate boundary vector is Algorithm 3 shows an online MR algorithm based on EA update.
Specially, while choosing a small stationary  on each learning round, we must have (  ) ≥ ( −1 ).In this condition, the update process of boundary vector can be written as In fact, the dual coefficients in  −1 also can be updated in (28).Since  −1 has ( − 1) 2 dual coefficients, it is impossible to update them, respectively.We introduce a new variable   into (29), as From (30), we can get that a gradient ascent update on   actually means to multiply all the dual coefficients in  −1 by 1 −   .Since every dual coefficient in  −1 belongs to [0, 1], we constrain   ∈ [0, 1].The initial value of   is zero.Using a gradient ascent on   , we obtain Therefore, we choose   ∈ {  ,  0 , The optimal step size  *  also can be obtained using (24).Obviously, if   ∈ [0, min{ max  ,  *  }], (  ) ≥ ( −1 ).Rewriting the overall update process using the form of (19), we have Algorithm 4: The process of getting   for overall update.
The new associate boundary vector is Algorithm 5 shows the online MR algorithm based on overall update.
Like EA update, we also can derive -overall update and aggressive-overall update from the previous analysis.

Two-Step Update.
In the two update schemes aforementioned, we actually make an assumption that the elements of an example (x  ,   ,   ) arrive at the same time.But in some practical applications, the label   is received after receiving training point x  occasionally.There is no need to update the boundary vector after receiving all the elements of an example.Here, we propose a two-step update scheme.
The two-step update scheme has twice updates on each learning round.The first update takes place after the training point x  arrives which updates the boundary vector using the geometry of the training points.The second update takes place after   ,   arrive which updates the boundary vector using the label.Obviously, EA update and overall update can be used in each update process of two-step update scheme.For example, we use EA update to describe the update process of two-step update scheme.
Denote as  −1/2 the coefficient vector after the first update on round  and  −1/2 its associate boundary.The example-associate coefficients in the first update on round  are  1 1 ,  2 1 , . . .,  1 (−1) ,  2 (−1) , and new associate boundary vector can be written as In the second update process, the example-associate coefficient is  0 , and new associate boundary vector is If   = 0, the second update process in (35) would not happen, and the two-step update degenerates into EA update.The range of  −1/2 and   can be obtained by the same process in Section 4.1.Similar as the previous analysis, the overall update also can be used in each update process of two-step update scheme.
The online MR algorithm based on the two-step update can be described in Algorithm 6.
This update scheme is more like a new perspective of online MR problem, and its effect is influenced by the update schemes on each step.Therefore, we pay more attentions to the first two update schemes aforementioned in this paper.

Sparse Approximations for Kernel Representation
In practice, kernel functions are always used to find a linear classifier, like SVM.Our online MR framework contains the product of two points, so we can easily introduce the kernel function in our framework.If we note  the kernel matrix such that x  can be replaced by Φ(x  ) in our framework.Therefore, we can rewrite (19) as Unfortunately, the online MR algorithms with kernel functions in Section 4 have to store the example sequence up to the current round (worst case).While using a buffering strategy for online MR which has a buffer size of , the stored matrix size is  ×  and the time complexity is ( × ) on round .For practical purpose, we present two approaches to construct a sparse kernel representation for boundary vector on each round.

Absolute Threshold.
To construct a sparse representation for the boundary vector, absolute threshold discards the examples whose associated coefficients are close to zero (more details in Section 7).Let  > 0 denote the absolute threshold.When the absolute value of the associated coefficient of an input example x  does not increase in further update process, x  will be discarded if The examples in the buffer cannot be discarded since the absolute values of their associated coefficients may increase in next rounds.The process of sparse approximation based on absolute threshold can be described in Algorithm 7.
The process of sparse approximation based on absolute threshold for different update schemes may be a little different in practical applications.For online MR algorithms with EA update, the coefficients of input examples which are not in the buffer will not change in further update process, and this sparse approximation process only deals with the example (x − for Buffer-) which is removed from the buffer on round .For online MR algorithms with overall update, this sparse approximation process deals with all the examples which are not in the buffer on current round since the coefficients of these examples also can be changed.This approach may not work; if we are unlucky enough that all the |  | are larger than  on each round, the kernel representation of boundary vector will not become sparse at all.

𝑘 Maximal Coefficients (𝑘-MC).
Another way to construct a sparse kernel representation is to keep the examples of which the absolute value of associated coefficients are the first  maximum.This approach is called  maximal coefficients (-MC) in this paper.Similar as the absolute threshold, -MC does not discard the examples in the buffer of which absolute values of associated coefficients may increase in next round.The process of sparse approximation based on -MC can be described in Algorithm 8.
While using -MC for online MR algorithms which has a buffer size of , the stored kernel matrix size is at most  × ( + ) and the time complexity is (1) on each round.

On the Connection to Previous Work
6.1.About Dual Ascending Procedure.In the area of online learning, Shalev-Shwartz and Singer [13] propose a primaldual perspective of online supervised learning algorithms.This work has the same dual ascending perspective as ours to achieve a better boundary vector.Different from it, we deal with an online MR problem of semisupervised learning, and our emphasis is how to construct a dual ascending model in semisupervised condition.An important conclusion in this paper is that the Fenchel conjugate of hinge functions is a key to transfer manifold regularization from offline to online, and this is also the reason why we use an absolute function to describe the difference between the predictions of two points.The primal basic MR problem can degenerate into a basic supervised learning problem [14] while choosing the tradeoff parameter  2 = 0. Consider Then, the dual function degenerates into Equation ( 39) is the dual function of basic supervised learning problem which is carefully discussed in [13].
6.2.About Online Manifold Regularization.Goldberg et al. [11] propose an empirical study of online MR which deals with the MR problem as follow: where  1 and  2 are trade-off parameters and  is the number of labeled examples.Different from our framework, they use a square function to measure the difference between the predictions of two points (see Figure 2).To avoid minimizing (40) directly, they further propose an instantaneous regularized risk   () empirically on round .Consider  / is the reverse label probability 1/  , which it assumes to be given and easily determined based on the rate at which humans can label the data at hand.In our work, we ignore this rate since it can be involved in the trade-off parameters  1 and  2 .
Based on the notion that   has a form as   = ∑  =1   x  , Goldberg et al. perform a gradient descent step over  that aims to reduce the instantaneous risk   () on each round.The update scheme can be written as Mathematical Problems in Engineering Furthermore, this work uses an annealing heuristic trick which chooses a decaying step size   = /√,  = 0.1.This online MR algorithm is an empirical result which demonstrates its practicability by experiments and does not have enough theoretical analysis.
Compared with previous work, our online MR framework reinterprets the online MR process based on the notion of ascending the dual function, and it also can be used to derive different online MR algorithms.Here, we demonstrate that the update scheme in (42) can be derived from our online MR framework.
In Section 4.2, the gradient direction d of overall update for ascending the dual function on round  can be written as While choosing we have ) . (45) we must have ⟨d, d  ⟩ > 0 and d  is a feasible ascending direction to make (  ) ≥ ( −1 ).Using d  to ascend the dual function, the update scheme can be written as Equations ( 42) and ( 46) are essentially the same update scheme with different trade-off parameters and edge weights.

Experiments and Analyses
This section presents a series of experimental results to report the effectiveness of our derived online MR algorithms.It is known that the performance of semisupervised learning depends on the correctness of model assumptions.Thus, our focus is on comparing different online MR algorithms, rather than different semisupervised regularization methods.
7.1.Datasets and Protocols.We report experimental results on two artificial and two real-world datasets in Table 1 with different properties.
The artificial datasets consist of two-class problems.The generated method of two moons dataset is available at http:// manifold.cs.uchicago.edu/manifoldregularization/manifold .html;we set the radius of two moons to 4 and the width to 2, and only one example for each class is labeled in this dataset.To demonstrate that our online MR can handle concept drift, we also perform our experiments on two rotating spirals dataset of which 2% examples are labeled.Figure 3 shows that the spirals smoothly rotate 360 ∘ during the sequence, and the target boundary drifts with the sequence of examples.
The real-world datasets consist of two-class and multiclass problems.The Isolet dataset derives from the Isolet database of letters of the English alphabet spoken in isolation (available from the UCI machine learning repository).The database contains utterances of 150 subjects who spoke the name of each letter of the English alphabet twice.The speakers are grouped into 5 sets of 30 speakers each, referred to as isolet1 through isolet5.We considered the task of classifying the first 13 letters of the English alphabet from the last 13 only using isolet1 and isolet5 (1 utterance is missing in isolet5 due to poor recording).During the online MR process, all 52 utterances of one speaker are labeled and all the rest are left unlabeled.Our USPS dataset contains the USPS training set on handwritten digit recognition (preprocessed using PCA to 100 dimensions), and we apply online MR algorithms to 45 binary classification problems that arise in pairwise classification; 5 examples are randomly labeled for each class.
Our experimental protocols are as the following.
(1) The training sequences are generated randomly from each datasets (except for two rotating spirals).
(2) The offline MR algorithm for comparison is a state-ofthe-art semisupervised learning algorithm based on manifold regularization which is called LapSVM [12].
(3) Each example in each dataset is trained once during online MR process.
To avoid the influence of different training sequences, all results on each dataset are the average of five such trials except for two rotating spirals (this idea is inspired by [11]).The error bars are ±1 standard deviation.
All methods use the standard RBF kernel (x  , x  ) =  −‖x  −x  ‖ 2 /2(  ) 2 .The edge weights are Gaussian weights which define a fully connected graph, and the edge weight parameter is .For online MR algorithms comparisons, we choose Buffer- with  = 200 to avoid high computational complexity.We implemented all the experiments using MATLAB.

Computational Complexity.
For offline MR, a  ×  kernel matrix needs to be stored and inverted on round , and the time complexity approximately amounts to ( 3 ) if using a gradient descent algorithm.Different from it, the computational complexity of our online MR algorithms is determined by the buffer size  and the number of examples in the kernel representation of boundary vector on each round.
For our online MR without buffering strategies and sparse approximation approaches, the number of examples in the kernel representation is , and the time complexity is ( 2 ).While using a buffering strategy for online MR which has a buffer size of , the time complexity reduces to ( × ), but the number of examples in the kernel representation is still .In practice, only part of the examples have to be stored (and computed) based on the sparse approximation.We also compare cumulative runtime curves of five different MR algorithms on the two moons and Isolet datasets.The first one is basic online MR which only uses -EA update, but no buffering strategies and sparse approximation approaches.The second one is online MR which uses -EA update and Buffer- ( = 200).The third one is online MR which uses -EA update, Buffer- ( = 200), and an absolute threshold  = 0.001.The fourth is online MR which uses -EA update, Buffer- ( = 200), and -MC ( = 400).The last one uses offline MR (LapSVM) on each round.Figure 5 shows that online MR with buffering strategies and sparse representation performs better than basic online MR and offline MR on the runtime growth rate.Online MR algorithms without buffering strategies and sparse approximation approaches are time consuming and memory consuming, and it is intractable to apply them to real-world long time tasks.
The cumulative runtime growth curves of online MR with buffering strategies and sparse approximation approaches scale only linearly, while the others scale quadratically.

Accuracies.
We used the same model selection strategy both for our online MR framework and traditional offline MR algorithms.
Based on the idea of "interested in the best performance and simply select the parameter values minimizing the error" [15], we select combinations of the parameter values on a finite grid in Table 2, and it is sufficient to perform algorithm comparisons.
While choosing an update scheme based on our online MR framework, we still have to select a step size   on each learning round.We report the online MR error rate for three scenarios in this paper.(ii) Aggressive step size   = min{ max  ,  *  }. (iii) Decreasing step size   = 0.1/√, which is also used in [11].
The best performances of all the online MR algorithms are presented in Table 3 and Figure 6.The following sections provide more additional details.

Additional Results
. We now provide some additional results along the online MR algorithms run and discuss more precisely the effect of our derived online MR algorithms.7.4.1.Effect of the Parameters   ,  1 ,  2 and the Step Size .The parameters   ,  1 and  2 have similar effects on generalization as in the purely offline MR approach (see [12] for an empirical study).Figure 6: Mean test error rates for 45 binary classification problems on USPS dataset.The results show that the online MR with an aggressive step size does not perform well on this dataset, and the others achieve test accuracies that are comparable to LapSVM.
The step size   controls the increment of dual function () on each learning round.We used three different step size selection methods for algorithm comparisons in last section.Here, we discuss the effect of different step size selection methods.

Stationary
Step Size.Under mild conditions, this seemingly naive step size selection method has acceptable error rates on any input sequence.Figure 7 shows that a large stationary step size does not perform well in online MR algorithms.When one wishes to avoid optimizing the step size on each learning round, we suggest the stationary step size with a small value.

Aggressive
Step Size.Since online MR algorithms adjust the boundary according to the local geometry of the incoming point and its label, the aggressive step size selection method aims to search for the optimal step size to increase the dual function more aggressively on each learning round.The experiments in Table 3 and Figure 6 imply that the aggressive selection method does not perform well on all the sequences.

Decreasing
Step Size.This step size selection method is based on an idea that the boundary vector is approaching the optimal boundary as the online MR algorithms run.This selection method performs well on the datasets whose target boundaries are fixed, but the experiments on the two spirals dataset show that it does not perform well for drifting target boundaries.

Increasing Dual Function 𝐷(𝛼) Achieves Comparable
Risks and Error Rates.We compare the primal objective function (  ) versus the dual function (  ) on the training sequence of two moons dataset as  increases.Figure 8 shows that the two curves approach each other along the online MR process using EA update (  = 0.1).The value of dual function (  ) never decreases as  increases; correspondingly, the curve of primal function (  ) has a downward trend and some little fluctuations.Our experiments support the theory in Section 3 that increasing the dual problem achieves comparable risks of primal MR problem.
We also report the performance of   on the whole dataset in Figure 9.This result shows that the decision boundary is adjusted to be a better one along the online MR process.Since online MR adjusts the decision boundary according to the label of the incoming example and the local geometry of the buffer on each learning round, the error rate of   on the whole dataset is not always decreasing along the online MR process.It is also the reason why online MR can track the changes in the data sequence.the course of learning, the algorithms are expected to track the changes in the data sequence.In the two rotating spirals dataset, the points will change their true labels during the sequence and every stationary boundary vector will have an error rate of 50%.
We show the error rates of basic online MR versus online MR (Buffer-) with different buffer sizes in Figure 10.This experiment illustrates that a suitable buffer size is able to adapt to the changing sequence and maintain a small error rate.

Conclusion and Future Directions
In this paper we presented an online manifold regularization framework based on dual ascending procedure.To ascend the dual function, we proposed three schemes to update the boundary on each learning rounds.Unfortunately, the basic online MR algorithms are time consuming and memory consuming.Therefore, we also applied buffering strategies and sparse approximation approaches to make online MR algorithms practical.Experiments show that our online MR algorithms can adjust the boundary vector with the input sequence and have risk and error rates comparable to offline MR.Specially, our online MR algorithms can handle the settings where the target boundary is not fixed but rather drifts with the sequence of examples.
There are many interesting questions remaining in the online semisupervised learning setting.For instance, we plan to study new online learning algorithms for other semisupervised regularizers those, in particular that with non-convex risks for unlabeled examples like S3VMs.Another direction is how to choose more effective parameters intelligently during the model selection.

Figure 1 :
Figure 1: The absolute distance function and its components.The absolute function |x| (a) can be decomposed into the sum of two hinge functions [x] + (b) and [−x] + (c).

Figure 2 :
Figure 2: Different functions to measure the difference of prediction of two examples.Standard MR uses a square function, while our online MR framework uses an absolute function which can be decomposed into two hinge functions.

Figure 3 :
Figure 3: Two rotating spirals data sequence.We spin the two spirals dataset in top left during the sequence so that the spirals smoothly rotate 360 ∘ in every 8000 examples.
Figure 4   shows the number of examples in the kernel representation of

Figure 4 :
Figure 4: The number of examples in the kernel representation of boundary vector for different sparse approximation approaches.This experiment is on the two moons dataset which has 4000 examples.If no sparse representation approaches are used in the online MR, the kernel representation contains all the input examples.The number of examples in the kernel representation of boundary vector increases slowly while using an absolute threshold, and the number is at most ( + ) while using -MC for online MR algorithms.

Figure 5 :
Figure 5: Cumulative runtime growth curves.(a) Experiments on two moons dataset, we generate a dataset which contains 4000 examples.(b) Experiments on Isolet dataset, this dataset has a high dimension.The curves have the similar trends on different datasets.Online MR algorithms with buffering strategies and sparse representation perform better than the others on the growth rate.
However, one has to try many choices of parameters during the model selection.The manifold regularizer incorporates unlabeled examples and causes the decision vector to appropriately adjust according to the geometry of training examples as  2 is increased.If  2 = 0, the unlabeled examples are disregarded and online MR degenerates into online supervised learning.
Overall update (aggressive step size) Overall update (stationary step size Overall update (decreasing step size  t = 0.1/√t) (b)

Figure 10 :
Figure 10: Error rates with different buffer sizes on two spirals data sequence.The buffer size can affect the capability of online MR to track the changes in the data sequence.

)
INPUT: two positive scalars:  1 and  2 ; edge weights   .INITIALIZE: a coefficient vector  0 and its associated decision boundary vector  0 .PROCESS: For  = 1, 2, ..., Receive an example (x  ,   ,   ), Get   using the process in Algorithm 2, Choose a step size   ∈ [0, min{ max Update the boundary vector using (27), If   = 0, predict ŷ = sign(⟨  , x  ⟩), Renew the buffer.Algorithm 3: Online manifold regularization algorithm based on EA update. PROCESS: INPUT: the absolute threshold ; the kernel representation of boundary on round :   = ∑ ∈[] (  )  Φ (x  ), PROCESS: For each x  in   If x  is not in the buffer and     (  )      < , discard the example x  and its associated coefficient (  )  .Return a new boundary   .Algorithm 7: The process of sparse approximation based on absolute threshold.This process only deals with the examples which will not be updated in the further update process.INPUT: the parameter ; the kernel representation of boundary on round :   = ∑ ∈[] (  )  Φ (x  ), PROCESS: For each x  in   and not in the buffer If (  )  does not belong to the first  maximum of the coefficients, discard the example x  and its associated coefficient (  )  .Return a new boundary   .Algorithm 8: The process of sparse approximation based on -MC.The kernel representation for   contains  +  examples at most in this condition, where  is the buffer size.

Table 1 :
Different datasets in our experiments.These datasets have different properties which contain number of classes, dimensions, and size.

Table 2 :
A finite grid of parameter values.We find the best performance of each online MR algorithm on this finite grid.

Table 3 :
Mean test error rates on different datasets.The error rates are reported for three different step size selection methods in the form of stationary step size/aggressive step size/decreasing step size.The result shows that our derived online MR algorithms achieve test accuracy comparable to offline MR.Specially, the experiments on two rotating spirals show that our online MR is able to track the changes in the sequence and maintain a much better error rate compared to offline MR.The performances of online MR algorithms are competitive with those of the state-of-the-art offline MR.