Generalization Bounds for Coregularized Multiple Kernel Learning

Multiple kernel learning (MKL) as an approach to automated kernel selection plays an important role in machine learning. Some learning theories have been built to analyze the generalization of multiple kernel learning. However, less work has been studied on multiple kernel learning in the framework of semisupervised learning. In this paper, we analyze the generalization of multiple kernel learning in the framework of semisupervised multiview learning. We apply Rademacher chaos complexity to control the performance of the candidate class of coregularized multiple kernels and obtain the generalization error bound of coregularized multiple kernel learning. Furthermore, we show that the existing results about multiple kennel learning and coregularized kernel learning can be regarded as the special cases of our main results in this paper.


Introduction
Kernel-based learning is related to achieve nonlinear machine learning tasks from linear ones. In the real applications, selecting a good or suitable kernel for the kernel-based learning is an important and difficult task. To this end, an approach named multiple kernel learning has been developed, and it allows to automatically choose the best kernel from a predefined kernel class. e earliest work of multiple kernel learning can be traced back to the research in [1], where the authors proposed to automatically pick up a linear combination of candidate kernels for the support vector machines based on a semidefinite programming approach. eoretical generalization analysis of multiple kernel learning has been widely studied by many researchers [1][2][3][4][5][6][7]. In particular, Ying and Campbell in [2] proposed a novel generalization bound (Rademacher chaos complexity) for the study of multiple kernel learning. However, the discussions in [2] were for single view and supervised learning. In this paper, we will employ Rademacher chaos complexity proposed in [2] to study the generalization error of coregularized multiple kernel learning in the semisupervised multiview learning framework.
Semisupervised multiview learning as an area of machine learning is trained with both labeled samples and unlabeled samples, and the unlabeled samples are helpful to reduce the amount of the labeled samples. Semisupervised multiview learning supposes that the train samples can be represented by multiple views. e coregularized least squares algorithm-a semisupervised version of regularized least squares with two views-is a typical multiview learning model that uses the unlabeled samples to estimate the view incompatibility of models [8,9]. Rosenberg in [10] extended the coregularized least squares algorithm to the case of kernel cotraining. And Brefeld et al. and Rosenberg in [11,12,13] discussed the generalization bound of kernelbased learning with multiple (or two) views in the semisupervised learning framework. However, the discussions in [11,12,13] supposed that the kernel used to construct the reproducing kernel Hilbert space is predefined. erefore, their results cannot be used to the analysis of multiple kernel learning.
In this paper, we will discuss the generalization error of coregularized multiple kernel learning in the semisupervised multiview learning framework. And we show that the results in [2] and [11] can be regarded as the special cases of our main results. e rest of the paper is organized as follows. In Section 2, we introduce some basic notations and definitions for later discussion. In Section 3, we discuss the related research and put forward the question that will be studied in this paper. In Section 4, we present our main results. In Section 5, we give the main proofs for our main results proposed in Section 4. In Section 6, we give a comparative discussion of our results to the existing work and show that the results about multiple kennel learning in [2] and coregularized kernel learning in [11] can be regarded as the special cases of our main results. e last Section 7 concludes this paper.

Notations and Definitions
In this section, we introduce notations and definitions for later discussions: (i) Let N be the set of natural numbers and R be the set of real numbers. Let N n � 1, 2, . . . , n { }, n ∈ N. (ii) Let (Ω, A, P) be a probability space; that is, Ω alone is called the sample space, A is a σ-algebra on Ω, and P is a probability measure on (Ω, A). And Ω has the structure Ω � X × Y(⊂ R), where X and Y are the input space and output space, respectively. Denote P X as the marginal distribution on X. For ignoring the discussion of measure theory, we simply denote (Ω, A, P) as (Ω, P). (iii) Let F be the set of all measurable functions f : X ⟶ Y. Assume that H is a subset of F. at is, H ⊂ F, the set H is called the hypothesis class. (iv) Let S � z i � (x i , y i ), i ∈ N n be a finite set of the labeled training samples, and assume these samples are independent and identically distributed (i.i.d.) according to P. Denote the bold letter as a vector; for example, z presents a vector (z 1 , z 2 , . . . , z n ). (v) For the sign | · |, if D is a set, we use |D| to represent the number of elements of a set and if D is a function, we use |D| to represent the absolute value of the function D. (vi) If A is a matrix, we use A T to represent the transpose of the matrix A.
and the loss of f on a sample point In learning theory, one of the purposes is to pick up a function f in hypothesis space H that minimizes the following generalization error: Generally speaking, the distribution P in the above Equation (1) is unknown. Rather than minimizing E P [L(f(X), Y)], we usually minimize the empirical or training error below: where the sign S represents the finite labeled samples and In this paper, the main quantity we are interested in is the following uniform estimation of the difference between the generalization error and empirical error: For the discussion in the later sections, we introduce the following four definitions and one lemma (Definition 5 is proposed by us).
Definition 1 (Empirical Rademacher Complexity) [2]. Let H be a class of functions f : Ω ⟶ R. e samples x i , i ∈ N n , are independently drawn from the probability space (Ω, P). e empirical Rademacher complexity can be defined as where the random variables σ i , i ∈ N n , are Rademacher variables, and σ presents a vector (σ 1 , σ 2 , . . . , σ n ).
Definition 2 (Empirical Rademacher Chaos Complexity) [2]. Let H be a class of functions f : Ω × Ω ⟶ R. e samples x i , i ∈ N n , are independently drawn from the probability space (Ω, P). e empirical Rademacher chaos complexity can be defined as where the random variables σ i , i ∈ N n , are Rademacher variables, and σ presents a vector (σ 1 , σ 2 , . . . , σ n ). [14]. e function

Definition 3 (Reproducing Kernel Hilbert Space, RKHS)
is a reproducing kernel of the Hilbert space H K if and only if.
A Hilbert space of functions which possesses a reproducing kernel is called a reproducing kernel Hilbert space.

Remark 1.
e second condition in the above Definition 3 is called "the reproducing property": the value of the function f at the point x is reproduced by the inner product of f with K(x). From the above two conditions, for any, (x, y) ∈ X × X, it is clear that In real applications, the solution to many reproducing kernel Hilbert space optimization questions is contained in a special subspace of the reproducing kernel Hilbert space, 2 Computational Intelligence and Neuroscience often known as the "span of the data". e span of the data for an reproducing kernel Hilbert space H K is the linear subspace (Appendix A.3 on page 75 of [13]): For simplicity, we denote the above linear subspace by } be a class of kernels. In this paper, we assume that κ ≜ sup K∈K,x∈X Definition 4 (Subnormalized Functional with Degenerate Dimension n). If a loss function ℓ : then we call ℓ a subnormalized functional with degenerate dimension n on ( i∈N n H K i ) × Y. Lemma 1. [13] Let H be a reproducing kernel Hilbert space with kernel K : X × X ⟶ R, and consider any point x ∈ X.
If L K is a closed subspace containing K(x), then the projection of f onto L K has the same value at x as f does. at is, For multiple kernel learning, the main task is to automatically choose a kernel K from a predefined class K of kernels, and find a function from the reproducing kernel Hilbert space H K that is most suitable to model the given samples.
In this paper, our purpose is to minimize over the class Here, let K and K ′ be the classes of kernels. H K 1 and H K 2 denote the reproducing kernel Hilbert spaces. m represents the amount of the labeled points (x i , y i ) ∈ X × Y, i ∈ N m , and u represents the amount of the unlabeled points x m+i ∈ X, i ∈ N u . e signs c 1 , c 2 , and λ mean the regularization parameters. e functions f 1 and f 2 , respectively, represent two viewers, and the function ℓ(f 1 (·), f 2 (·), ·) is the labeled loss function, which measures the performance of f 1 and f 2 on the labeled points z i , i ∈ N m .

Related Work
In [11], Rosenberg and Bartlett used Rademacher complexity to bound the coregularized kernel class in the semisupervised two-view learning framework, and two viewers are two predefined reproducing kernel Hilbert spaces (H 1 and H 2 , respectively). Take labeled points z i � (x i , y i ) ∈ X × Y, i ∈ N m , and unlabeled points x m+i ∈ X, i ∈ N u . e coregularized least squares algorithm discussed in [11] can be described as follows: where m represents the amount of the labeled points (x i , y i ) ∈ X × Y, i ∈ N m , and u represents the amount of the unlabeled points x m+i ∈ X, i ∈ N u . e signs c 1 , c 2 , and λ mean the regularization parameters. e functions f 1 and f 2 , respectively, represent two viewers, and the function ℓ(f 1 (·), f 2 (·), ·) is the labeled loss function, which measures the performance of f 1 and f 2 on the labeled points In [11], the final output is denoted as (f 1 z + f 2 z )/2. In [2], Ying and Campbell applied the Rademacher chaos complexity to study the generalization of multiple kernel learning in the supervised learning framework. e multiple kernel learning model they considered is as follows: where m represents the amount of the labeled points is the loss function. e sign λ means the regularization parameters. And H K denotes the reproducing kernel Hilbert spaces. In Equation (13), H K is not predefined and depends on the kernel choose from the class K of kernel.
In this paper, we are interested in the topic of coregularized multiple kernel learning; that is, the two reproducing kernel Hilbert spaces are not defined in advance. Our discussions are in the framework of semisupervised multiview learning. We give this learning question as the following two-layer minimization formation: where let K and K ′ be the classes of kernels. H K 1 and H K 2 denote the reproducing kernel Hilbert spaces. m represents the amount of the labeled points (x i , y i ) ∈ X × Y, i ∈ N m , and u represents the amount of the unlabeled points Computational Intelligence and Neuroscience x m+i ∈ X, i ∈ N u . e signs c 1 , c 2 , and λ mean the regularization parameters. e functions f 1 and f 2 , respectively, represent two viewers, and the function ℓ(f 1 (·), f 2 (·), ·) is the labeled loss function, which measures the performance of f 1 and f 2 on the labeled points z i , i ∈ N m . Remark 2. We will explain the following: (1) Equation (14) given in this paper is different from Equation (12): (a) e solution from Equation (14) is to minimize on the class , while the solution from Equation (12) is to minimize on the class (14) is through two minimization steps: first, it finds the most suitable reproducing kernel for the given samples. Second, it obtains the best function/model from the found reproducing kernel Hilbert spaces in the first step. While in Equation (12), it only needs to get the best function/model from the reproducing kernel Hilbert spaces which are predefined.
(2) Equation (14) given in this paper is different from Equation (13): (a) e solution from Equation (14) is to minimize on the product space , while the solution from Equation (12) is to minimize on the space (∪ K 1 ∈K H K 1 ). (b) e minimization item in Equation (13) is much simpler because the analysis on Equation (13) is limited to supervised learning and single view and does not deal with unlabeled samples.
From the above discussion, we can see that the generalization analysis about Equation (14) will make more meaningful and bring greater challenge.
In the next section, we will present the main results of this paper.

Generalization Bounds
In this section, we assume that the loss function ℓ in Equation (11) is the subnormalized functional with degenerate di- Note that Q(f 1 , f 2 ) ≥ 0, and then for any samples (x i , y i ), i ∈ N m and x m+i , i ∈ N u , the class of candidate reproducing kernel Hilbert spaces is defined as follows: at is, the solution (f 1 z , f 2 z ) minimizing Q(f 1 , f 2 ) belongs to the class H K × H K′ . For simplicity, we use H K,K′ to denote H K × H K′ in the next sections.
As the assumption in [11], the final predictor for the coregularized multiple kernel learning is selected from the following class: Remark 3. In Equations (16) and (17), we can see that H K,K′ and H K,K′ mainly depend on the prescribed set of candidate kernels and the unlabeled data. For any f ∈ H K,K′ , we define the expected loss as Equation (2), and we can use the loss function L to compute the labeled empirical loss in Equation (11). For the given samples (x i , y i ) ∈ X × Y, i ∈ N m , the loss can be also computed as Equation (3).
If L satisfies the Lipschitz continuous condition on H K,K′ , we introduce the constant defined by and the local Lipschitz constant denoted as In the end of this section, we give the main theorems in this paper. Theorem 1. Let the function L be a subnormalized functional with degenerate dimension 1 on H K,K′ × Y and satisfy the Lipschitz continuous condition on H K,K′ . Let the labeled samples z i � (x i , y i ) ∈ X × Y, i ∈ N m , be independent random variables drawn from the probability space (X × Y, P), and the unlabeled samples x m+i ∈ X, i ∈ N u , be independent random variables drawn from the probability space (X, P X ). en, for any δ ∈ (0, 1), with probability at least 1 − δ, for any f ∈ H K,K′ , the following inequality holds 4 Computational Intelligence and Neuroscience where σ presents a vector (σ 1 , σ 2 , . . . , σ n ) and σ i , i ∈ N m , are Rademacher variables,

and orthogonal matrix O, and
Corollary 1. Under the assumption of eorem 1, we have the following inequality: Corollary 2. Under the assumption of eorem 1 and assume K � K ′ and c 1 � c 2 , the following inequality holds Remark 4. We will proof eorem 1 and Corollaries 1 and 2 in Section 5. And in Section 6, we will reveal that eorem 1 and Corollary 1 are the extensions of the results in [2] and [11], respectively.

Proofs
In this section, we will prove eorem 1 and Corollaries 1 and 2 in Section 4. As the preparation for the next proof, we give two following lemmas (Lemmas 2 and 3) in advance.

Lemma 2.
Let the function L be a subnormalized functional with degenerate dimension 1 on H K,K′ × Y, and satisfy the

Computational Intelligence and Neuroscience
Lipschitz continuous condition on H K,K′ . Let the labeled samples z i , i ∈ N m , be independently drawn from the (X × Y, P) and the unlabeled samples x m+i ∈ X, i ∈ N u , be independently from the probability space (X, P X ). en, for any δ ∈ (0, 1), with probability at least 1 − δ, we have the following inequality: where σ presents a vector (σ 1 , σ 2 , . . . , σ n ) and σ i , i ∈ N m , are Rademacher variables.
Proof. For simplicity, let Replacing the ith sample (x i , y i ) in the labeled samples with ( 2m .
Proof. As defined in Equation (19), L loc sup is the local Lipschitz constant of L. And by the contraction property of Rademacher complexity (Lemma 26.9 on page 331 of [16] and eorem 7 of [17]), the result is as follows.
Computational Intelligence and Neuroscience Lemma 4. If Equation (11) has a solution, then, for a fixed K 1 ∈ K and a fixed K 2 ∈ K ′ , it has a solution (f 1 z , f 2 z ) as follows: for some α � (α 1 , α 2 , . . . , α m+u ) T ∈ R m+u and β � (β 1 , β 2 , . . . , β m+u ) T ∈ R m+u . at is, the solution belongs to Proof. e result follows in a similar way to Proposition 2.3.1 in [11]. □ Lemma 5. Let the labeled samples z i � (x i , y i ) ∈ X × Y, i ∈ N m , be independent random variables drawn from the probability space (X × Y, P) and the unlabeled samples x m+i ∈ X, i ∈ N u , be independent random variables drawn from the probability space (X, P X ). e following inequality holds where σ presents a vector (σ 1 , σ 2 , . . . , σ n ) and σ i , i ∈ N m , are Rademacher variables,

and orthogonal matrix O, and
Proof. As defined in Equation (17), we can rewrite the lefthand side of inequality (35) as where the sign H K,K′ is defined in Equation (16). From the assumptions, we have By the reproducing property and Lemma 1, for any K 1 ∈ K and K 2 ∈ K ′ , for any i ∈ N m+u , we have Computational Intelligence and Neuroscience Combining Equations (39), (40), (41), and (42) yields that where By Lemma 4, for any K 1 ∈ K and K 2 ∈ K ′ , we have (this is similar to Section 5.2.1 converting to Euclidean space in [11]) where Hence, we have where 8 Computational Intelligence and Neuroscience Λ � (48) Note that the matrix Λ is not full rank, and by using the similar steps in [11], we can rewrite Equation (47) as where In the above equations, the matrices E K 1 and E K 2 are, respectively, diagonal matrices containing nonzero eigenvalues. And we write the projections of α and β onto the column spaces of K 1 and K 2 as V · a and W · b.
Next, we try to relate Equation (47) to Rademacher Chaos complexity. Note that e first equation can be easily obtained by the discussion of Section 5.2.4 in [11]. e second equation from en, we have Computational Intelligence and Neuroscience 9 (1) and Equations (60) and (61), we can get that en, the main result in [2] recovers from Corollary (1). erefore, the main result in [2] becomes the special case of Corollary (1).

Remark 5.
For the discussion in Section 3, if we set u � 0 and c 1 � c 2 � λ and by combining Equation (17), then we have that Equation (14) reduces to Equation (13). Furthermore, let |K| � |K ′ | � 1 and K � K ′ , and we have The work in [1] The work in this paper Kernel predefined The work in [2] Figure 1: Semisupervised learning as supervised learning when u � 0. And if K has a single kernel, we think that it is the special case of multiple kernel learning. e scope of the discussion in [2] is the intersection of the green and blue ellipses, the scope of the discussion in [11] is the yellow ellipse, and the cope of the discussion in this paper is the blue ellipse.
Substituting inequality (63) into inequality (62), we can obtain the generalization bound for the single kernel learning in the framework of supervised learning as follows: (2) For coregularized kernel learning in semisupervised learning.
If we let K 1 ∈ K, |K| � 1, K 2 ∈ K ′ , and |K ′ | � 1, by equation we have And note that Equation (65) is the same as the supremum evaluation in Section 5.2.2 in [11]. So, the main result in [11] recovers from eorem 1. en, we have that the main result in [11] can be regarded as the special case of eorem 1.
In Figure 1, we show the relations between the main results in this paper and the results in [2] and [11].

Conclusion
In this paper, based on semisupervised two-viewers learning, we study the generalization bound of coregularized multiple kernel learning. e main research tool is Rademacher chaos complexity which we use to control the performance of the candidate class of coregularized multiple kernels. In this paper, we mainly blend the work in [2] and [11] to discuss the generalization error of coregularized multiple kernel learning in the semisupervised multiview learning framework. First, we discuss the differences between the learning question proposed by us and the learning questions in [2] and [11]. en, we analyze the generalization bound of coregularized multiple kernel learning in the semisupervised multiview learning framework. And we show that the existing results in [2] and [11] can be regarded as the special cases of our main results.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.