Online Coregularization for Multiview Semisupervised Learning

We propose a novel online coregularization framework for multiview semisupervised learning based on the notion of duality in constrained optimization. Using the weak duality theorem, we reduce the online coregularization to the task of increasing the dual function. We demonstrate that the existing online coregularization algorithms in previous work can be viewed as an approximation of our dual ascending process using gradient ascent. New algorithms are derived based on the idea of ascending the dual function more aggressively. For practical purpose, we also propose two sparse approximation approaches for kernel representation to reduce the computational complexity. Experiments show that our derived online coregularization algorithms achieve risk and accuracy comparable to offline algorithms while consuming less time and memory. Specially, our online coregularization algorithms are able to deal with concept drift and maintain a much smaller error rate. This paper paves a way to the design and analysis of online coregularization algorithms.


Introduction
Semi-supervised learning (S 2 L) is a relatively new subfield of machine learning which has become a popular research topic throughout the last two decades [1][2][3][4][5][6]. Different from standard supervised learning (SL), the S 2 L paradigm learns from both labeled and unlabeled examples. In this paper, we investigate the online semi-supervised learning (OS 2 L) problems with multiple views which have four features: (1) data is abundant, but the resources to label them are limited; (2) data arrives in a stream and cannot store them all; (3) the target functions in each view agree on labels of most examples (compatibility assumption); (4) the views are independent given the labels (independence assumption).
OS 2 L algorithms take place in a sequence of consecutive rounds. On each round, the learner is given a training example and is required to predict the label if the example is unlabeled. To label the examples, the learner uses a prediction mechanism which builds a mapping from the set of examples to the set of labels. The quality of an OS 2 L algorithm is measured by the cumulative loss it makes along its run. The challenge of OS 2 L is that we do not observe the true label for unlabeled examples to evaluate the performance of prediction mechanism. Thus, if we want to update the prediction mechanism, we have to rely on indirect forms of feedback.
Lots of OS 2 L algorithms have been proposed in recent years (see a survey in [7,8]). A popular idea is defining an instantaneous risk function and decreasing its value in an online manner to avoid optimizing the primal semisupervised problem directly [9][10][11]. References [12][13][14] also treat the OS 2 L problem as online semi-supervised clustering in that there are some must-links pairs (in the same cluster) and cannot-links pairs (cannot in the same cluster), but the effects of these methods are often influenced by "bridge points" (see a survey in [15]).
Coregularization [2,16] is a method of improving the generalization accuracy of SVMs [17] by using unlabeled data in different views. Multiple hypotheses are trained in coregularization framework and are required to make similar predictions on any given unlabeled example. Moreover, theoretical investigations demonstrate that the coregularization approach reduces the Rademacher complexity by an amount that depends on the "distance" between the views [18,19]. Unfortunately, basic offline coregularization algorithms are still unable to deal with long-playing large-scale OS 2 L problems directly because of the constraint of time and memory.
In this paper, we introduce a novel online coregularization framework for the design and analysis of new OS 2 L 2 The Scientific World Journal algorithms. Since decreasing the primal coregularization objective function is impossible before obtaining all the training examples, we propose a Fenchel conjugate transform to increase the dual problem incrementally. The existing online coregularization algorithms in previous work can be viewed as an approximation of the dual ascending process based on gradient ascent. New online coregularization algorithms are derived based on the idea of ascending the dual function more aggressively. We also discuss the applicability of our framework to the settings where the target hypothesis is not fixed but drifts with the sequence of examples.
To the best of our knowledge, the closest prior work is proposed by de Ruijter and Tsivtsivadze [10]. Their method defines an instantaneous regularized risk function using part of examples to avoid optimizing the primal coregularization problem directly. The learning process is based on convex programming with stochastic gradient descent in kernel space. The update scheme of this work can also be derived from our online coregularization framework.
The rest of the paper will be organized as follows. In Section 2 we begin with a primal view of multiview semisupervised learning problem based on coregularization. In Section 3 our new framework for designing and analyzing online coregularization algorithms is introduced. Next, in Section 4, we demonstrate that the existing online coregularization algorithms can be derived from our framework using gradient ascent. New online coregularization algorithms are derived based on aggressive dual ascending procedures in Section 5. Experiments and analyses are in Section 6. In Section 7, conclusions and possible extensions of our work are given.

Basic Problem Setting
Our notation and problem setting are formally introduced in this section. The italic lower case letters refer to scalars (e.g., and ), and the bold letters refer to vectors (e.g., and ). (x , , ) denotes the th training example, where x = (x (1) , x (2) , . . . , x ( ) ) is seen in views with x ( ) ∈ ( ) ( ∈ {1, 2, . . . , }), is its label, and is a flag to determine whether the label can be seen. If = 1, the example is labeled; and if = 0, the example is unlabeled. The hinge function is denoted by [ ] + = max{ , 0}. ⟨ , x⟩ denotes the inner product between vectors and x.
In previous approaches based on coregularization [16,19], the distance function (⋅, ⋅) is often defined as a square function: The distance function is defined as an absolute function (using 1 norm) in this paper (this idea is also adopted by Szedmak and Shawe-Taylor [18] and Sun et al. [20]): Furthermore, (3) is composed of two hinge functions (see Figure 1 for an illustration) In the next section, we will show that the online coregularization problem can be discussed in the dual form of (1) more easily and directly while using the absolute distance function. Denote the instantaneous loss on round as where ∈ {1, 2, . . . , }. We thus get a simple version of (1) using (5): (2) ) .
The Scientific World Journal The minimization problem of (6) in an online manner is what we consider in the rest of this paper.

Online Coregularization by Ascending the Dual Function
In this section, we propose a unified online coregularization framework for multi-view semi-supervised binary classification problems. Our presentation reveals how the multiview S 2 L problem based on coregularization in Section 2 can be optimized in an online manner. Before describing our framework, let us recall the definition of Fenchel conjugate that we use as a main analysis tool in this paper (see the appendix for more details). The Fenchel conjugate of a function : dom → R is defined as * ( ) = sup {⟨ , ⟩ − ( ) : ∈ dom } .
As shown in (7), the Fenchel conjugate is defined only for single variable function in former convex analysis. We extend the definition of Fenchel conjugate to multivariables functions for solving online multiview S 2 L problem in this paper.
An equivalent problem of (6) is Using the Lagrange dual function, we can rewrite (12) by introducing a vector group Consider the dual function where * ( (1,2) ) is the Fenchel conjugate of ( (1,2) ). The primal problem can be described as maximizing the dual function as in the following , . . . , (1,2) ) .
Based on our definition of Fenchel conjugate for multivariables functions, the Fenchel conjugate of ( (1,2) ) can be rewritten as (based on Proposition 1 and Lemma A. Since our goal is to maximize the dual function, we can restrict to the first case in (16). * ( (1,2) ) has 3 associated coefficient variables which are 0 , 1 , and 2 .
Algorithm 1: A template online co-regularization algorithm for multiview semi-supervised binary classification problems. This template algorithm aims for increasing the dual function on each learning round.
Based on the previous analysis, the dual function can be rewritten using a new coefficient vectors . Consider the following: As shown in (17), our task has been transferred to a constrained quadratic programming (QP) optimization problem. Every input training example brings a vector into the dual function. 1 , 2 , . . . , are independent, so we can update the vectors group ( 1 , 2 , . . . , ) on each learning round to ascend the dual problem incrementally. Obviously, unobserved examples would make no influence on the value of dual function in (17) by setting their associated coefficient variables to zero.
Denote ( ) the coefficient vector on round ( ∈ {1, 2, . . . , }). The update process of coefficient vectors group ( 1 , 2 , . . . , ) on round should satisfy the following conditions: The first one means that the unobserved examples do not make influence on the value of dual function, and the second means that the value of dual function never decreases during the online coregularization process. Therefore, the dual function on round can also be written as Based on Lemmas A.1 and A.3 in the appendix, we can obtain that each coefficient vectors group ( 1 , 2 , . . . , ) has an associated boundary vector group ( (1) , (2) ). On round , the associated boundaries of (( 1 ) , ( 2 ) , . . . , ( ) ) are To make a summary, we propose a template online coregularization algorithm by dual ascending procedure in Algorithm 1.
Essentially, our online coregularization framework aims to break the large QP in the primal objective function into a series of dual ascending procedures on each learning round. Therefore, we can ascend the dual function in an online manner.

Analysis of Previous Work Based on Gradient Ascent in the Dual
In the previous section, a template algorithm framework for online coregularization is proposed based on the idea of ascending the dual function. In Algorithm 1, we can obtain that algorithms that derive from our framework may vary in one of two ways. First, different algorithms may update different dual variables on each learning round. The second way in which different algorithms may vary is how to update the chosen variables to ascend the dual function. Some online coregularization algorithms [9,10] have been suggested in recent years. These approaches have a similar idea of "defining an instantaneous coregularized risk to 6 The Scientific World Journal avoid optimizing the primal coregularization problem directly. " In these works, there are two popular instantaneous coregularized risk functions ( (1) , (2) ) which are defined as The online coregularization process in these works is based on convex programming with gradient descent on instantaneous coregularized risk function in kernel space. The step size is often defined to decay at a certain rate [11], for example, = 1/√ . The update process in these approaches can be summarized as (1,2) In the following, we demonstrate that these algorithms can be derived from our online coregularization framework. Since the dual coefficient vectors 1 , 2 , . . . , are independent, the dual function can be ascended by updating only the associated coefficient vector of the new arrived training example (x (1,2) , , ) on round that means And the task on round can be rewritten as ascending Using a gradient ascent (GA) step on , the update process on round can be written as where ≥ 0 is a step size.
In fact, the dual coefficient vectors in (1,2) −1 can also be updated in (23). Since (1,2) −1 has − 1 dual coefficient vectors, it is impossible to update them, respectively. We introduce a new variable into (23), From (25), we can get that a gradient ascent update on actually means to multiply all the dual coefficient vectors in (1,2) −1 by 1 − . Since every dual coefficient variable in (1,2) −1 is constrained, we also constrain ∈ [0, 1]. The initial value of is zero. While using a gradient ascent on , we obtain that The Scientific World Journal 7 Based on the previous analysis, the gradient ascent process of boundary vector group ( (1) , (2) ) can be written as As far as we know, all the existing online coregularization algorithms in previous work can be viewed as an approximation of our dual ascending process using gradient ascent in (27).

Deriving New Algorithms Based on Aggressive Dual Ascending (ADA) Procedures
In the previous section, we show that the online coregularization algorithms in previous work can be derived from our framework. These algorithms lead to a conservative increase of the value of the dual function since they only modify a single dual vector using gradient ascent on each learning round. In fact, more aggressive online coregularization algorithms can also be derived from our framework. In this section we describe broader and, in practice, more powerful online coregularization algorithms which increase the dual function more aggressively on each learning round. The motivation for the new algorithms is as follows. Intuitively, update schemes that yield larger increases of the dual function are likely to reach the minimal value of primal objective function faster. Thus, they are in practice likely to suffer a smaller number of mistakes.

Updating Single Dual Coefficient
Vector. The update scheme we described in Section 4 for increasing the dual function modifies the associated coefficient vector of the new arrived training example which is based on gradient ascent, and all the variables in the vector share a same step size. This simple algorithm can be enhanced by solving the following optimization problem on each learning round According to the type of the new arrived example, (28) can be solved in different ways. If the new arrived example is labeled, we have = 1, and the task on round can be rewritten as Since 1 , 2 ∈ [0, 1], we can obtain that Otherwise, if the new arrived example is unlabeled, we have = 0, and (28) can be rewritten as Since 0 ∈ [−1, 1], we can obtain that Based on the previous analysis, the update process of boundary vector group ( (1) , (2) ) can be written as (2) ) .
In contrast to the gradient approaches in Section 4, this approach ascends the dual function more aggressively. So far, our focus was on an update which modifies a single dual coefficient vector. In fact, all the associated coefficient vectors of the arrived examples can be updated during the online regularization process. We now examine another update scheme based on our online coregularization framework that suggests the modification of multiple dual coefficient vectors on each learning round.

Updating Multiple Dual Coefficient
is more like a forgetting factor [21] to downweight the contribution of observations whose indices do not belong to . When tracking the changes in the data stream, it is likely that recent observations will be more indicative of its appearance than more distant ones. Incorporating a forgetting factor in the online learning algorithms is a good way to moderate the balance between old and new observations. We can also obtain that = 0 ( ∈ {1, 2, . . . , }) indicates that no forgetting is to occur.
For practical purpose, we test two choices of to update multiple dual coefficient vectors in this paper.

Sparse Approximations for Kernel Representation.
In practice, kernel functions are always used to find a linear classifier, like SVM. Our online coregularization framework only contains the product of two points, so we can easily introduce the kernel function into our framework. If we note the kernel matrix such that x can be replaced by Φ(x ) in our framework. Therefore, we can rewrite (19) as The Scientific World Journal 9 Unfortunately, our previous derived online coregularization algorithms with kernel functions have to store the example sequence up to the current round, and the stored matrix size is × (worst case). For practical purpose, we present two approaches to sparsify the kernel representation of boundaries on each learning round.
Absolute Threshold. To construct a sparse representation for the boundaries, absolute threshold discards the examples whose associated coefficients (1,2) are close to zero. Let > 0 denote the absolute threshold. When an arrived example x (1,2) would not be used to update the boundaries in further learning process, x (1,2) is discarded if | ( (1,2) ) | < . The examples whose indices are in cannot be discarded on round since they would be used to ascend the dual function.
Maximal Coefficients ( -MC). Another way to sparsify the kernel representation is to keep the examples of which the absolute value of (1,2) is the first maximum. Similar as the absolute threshold, -MC does not discard the examples in which would be used to ascend the dual function on round . Based on this sparse approximation, the stored matrix size on round reduces to ( + size of ( )) × ( + size of ( )).
The previous two sparse approximations are both motivated by the fact that the examples which have larger coefficients tend to exert more influence on our learned boundaries.

Experiments
This section presents a series of experimental results to report the effectiveness of our derived online coregularization algorithms. It is known that the performance of semisupervised learning depends on the correctness of model assumptions. Thus, our focus is on comparing different online coregularization algorithms with multiple views, rather than different semi-supervised regularization methods.
We report experimental results on two synthetic and a real word binary classification problems. The prediction function in online coregularization algorithms are adopted as the average of the prediction functions from two views Based on the idea of "interested in the best performance and simply select the parameter values minimizing the error" [3], we select combinations of the parameter values on a finite grid in Table 1, and it is sufficient to perform algorithm comparisons.

Two-Moons-Two-Lines Synthetic Data
Set. This synthetic data set is generated similarly to the toy example used in [16,19] in which examples in two classes appear as two moons in one view and two oriented lines in another (see Figure 3 for an illustration). This data set contains 2000 examples, and only 5 examples for each class are labeled. A Gaussian and linear kernel are chosen for the two-moons and two-lines views respectively. In this data set, the offline coregularization algorithms (CoLapSVM) [16] achieve an error rate of 0.
The best performance of all the online coregularization algorithms in Section 5 is presented in Table 2. We also provide some additional details during the online coregularization process.
We compare cumulative runtime curves of online coregularization algorithms with different sparse approximation approaches in Figure 4. Online coregularization algorithms with sparse representation perform better than the basic online coregularization algorithms on the growth rate. The cumulative runtime growth curves of online coregularization algorithms with sparse approximation approaches scale only linearly, while the others scale quadratically.
We also compare the number of examples in the kernel representation of boundary vectors in two views on each learning round for different sparse approximation approaches. Figure 5 shows that only part of examples have to be stored (and computed) while using sparse approximation approaches. Online coregularization algorithms without sparse approximation approaches are time consuming and memory-consuming, and it is intractable to apply them to real-world long time tasks.
In Section 3, we have demonstrated the relationship between the primal objective function ( (1,2) ) and the dual function (( 1 ) , ( 2 ) , . . . , ( ) ). We compare the primal objective function versus the dual function on the training sequence of two-moons-two-lines data set as increases in Figure 6. The result shows that the two curves approach each other along the online coregularization algorithms run. The value of dual function never decreases as increases; correspondingly, the curve of primal function has a downward trend and some little fluctuations. We also observe that the curve of the primal objective function which updates multiple dual coefficient vectors on each learning round shows a more smooth downward trend and has less rapid fluctuations. This experiment supports the theory that increasing the dual problem can achieve comparable risks of the primal objective function.
We report the performance of (1,2) on the whole twomoons-two-lines data set in Figure 7. This result shows that the boundary vector is adjusted to be a better one along the online coregularization algorithms run. Since our algorithms adjust the decision boundary vector according to the local agreement of in the two views on each learning round, the curve of the error rate is not always decreasing along the online coregularization process. It is also the reason why our online coregularization algorithms can track the changes in the data sequence (more detail in Section 6.3). Similar as the experiments in Figure 6, we can observe that the error rate of (1,2) which updates multiple dual coefficient vectors on each learning round shows a more smooth downward trend and has less rapid fluctuations.
The Scientific World Journal 11 Table 2: Mean test error rates on the two-moons-two-lines synthetic data set. The error rates are reported for three different sparse approximations. For gradient ascent, we choose a decaying step size = 0.1/√ . The result shows that our derived online co-regularization algorithms achieve test accuracy comparable to offline co-regularization (CoLapSVM). The online co-regularization algorithms based on aggressive dual ascending procedures perform better than those based on gradient ascent.   (1,2) has a downward trend, but it is not always decreasing during the online coregularization process. Sun and Shawe-Taylor [19]. The task is to predict whether a web page is a course home page or not. The data set consists of 1051 web pages in two views (page and link) collected from the computer science department web sites at four U.S. universities: Cornell, University of Washington, University of Wisconsin, and University of Texas. The first view of the data is the textual content of a webpage itself, and the second view is all links pointing to the web page from other web pages. We preprocessed each view by removing stop words, punctuation, and numbers and then applied Porter's stemming to the text [22]. This problem has an unbalanced class distribution since there are a total of 230 course home pages and 821 noncourse. In addition, words that occur in five or fewer documents were ignored. This resulted in 2332 and 87 dimensional vectors for two views, respectively. Finally, document vectors were normalized to TDIDF features (the product of term frequency and inverse document frequency) [23]. As in [19], we randomly label 3 course and 9 noncourse examples for each class. In this experiment, the linear kernel is used for both views.
In this data set, the offline coregularization algorithms (CoLapSVM) [16] achieve an error rate of 6.32%. In Table 3, we report the best performance of all the online coregularization algorithms on the web page data set.

Rotating Two-Moons-Two-Lines Synthetic Data Set.
When the underlying distributions, both (x (1,2) ) and ( | x (1,2) ), change during the course of learning, the algorithms are expected to track the changes in the data sequence. In this subsection, we test the applicability of our framework to settings where the target hypotheses in different views are not fixed but rather drift with the sequence of examples. To demonstrate that our online coregularization algorithms can handle concept drift, we perform our experiments on rotating two-moons-two-lines data sequence. This data set contains 8000 examples, and only 1% examples for each class are labeled. Figure 8 shows that the two-moons-twolines data set smoothly rotate 360 ∘ during the sequence, and the target boundaries in two views drift with the sequence of examples. In the rotating two-moons-two-lines data set, the points will change their true labels during the sequence, and every stationary decision boundary will have an error rate of 50% approximately. A Gaussian and linear kernel are chosen for the two rotating moons and two rotating lines view, respectively.
In Table 4, we report the best performance of all the online coregularization algorithms on the rotating twomoons-two-lines data sequence. In this experiment, we also discuss the effect of the buffer size to track the changes in the data sequence.
Obviously, when tracking the changes in the rotating two-moons-two-lines data sequence, it is likely that recent examples will be more indicative of the boundaries than more distant ones. Buffer-updates the boundaries using the recent examples, it is also the reason why ADA (Buffer-) performs better than the other online coregularization algorithms. We report the error rates of ADA (Buffer-) with different buffer sizes on rotating two-moons-twolines synthetic data sequence in Figure 9. This experiment illustrates that a suitable size of Buffer-is able to adapt to the changing sequence and maintain a small error rate. Table 4: Mean test error rates on the rotating two-moons-two-lines synthetic data sequence. The error rates are reported for three different sparse approximations. For gradient ascent, we choose a stationary step size = 0.1. The result shows that our derived online co-regularization algorithms are able to track the changes in the sequence and maintain a smaller error rate compared with batch learning algorithms. Specially, ADA (Buffer-) performs better than the other online co-regularization algorithms.

Conclusion and Further Discussion
In this paper we presented an online coregularization framework based on the notion of ascending the dual function. We demonstrated that the existing online coregularization algorithms in previous work can be viewed as an approximation of our dual ascending process using gradient ascent. New online coregularization algorithms are derived based on aggressive dual ascending procedures. For practical purpose, we proposed two sparse approximation approaches for kernel representation to reduce the computational complexity. Experiments showed that our online coregularization algorithms can adjust the boundary vector with the input sequence and have risk and error rates comparable to offline algorithms. Specially, our online coregularization algorithms can handle the settings where the target boundaries are There are many interesting questions remaining in the online semi-supervised learning setting. For instance, we plan to study new online learning algorithms for other semisupervised learning models. Another direction is how to choose effective combination of the parameter values more intelligently during the online coregularization process.

Fenchel Conjugate
The Fenchel conjugate of a function : → R is defined as Since * is defined as a supremum of linear functions, it is convex. Here, we describe a few lemmas of Fenchel conjugate which we use as theoretical tools in this paper. More details are in [24].