Generalization Bounds Derived IPM-Based Regularization for Domain Adaptation

Domain adaptation has received much attention as a major form of transfer learning. One issue that should be considered in domain adaptation is the gap between source domain and target domain. In order to improve the generalization ability of domain adaption methods, we proposed a framework for domain adaptation combining source and target data, with a new regularizer which takes generalization bounds into account. This regularization term considers integral probability metric (IPM) as the distance between the source domain and the target domain and thus can bound up the testing error of an existing predictor from the formula. Since the computation of IPM only involves two distributions, this generalization term is independent with specific classifiers. With popular learning models, the empirical risk minimization is expressed as a general convex optimization problem and thus can be solved effectively by existing tools. Empirical studies on synthetic data for regression and real-world data for classification show the effectiveness of this method.


Introduction
The generalization ability is a main concern of statistical learning theory [1]. How to improve the predicting accuracy under the empirical risk minimization (ERM) principle has practical meaning since ERM-based learning process is widely used nowadays. As one important technique to improve generalization ability or avoid so-called overfitting, regularization plays a crucial role to maintain the trade-off of the empirical loss and the expected risk. Different regularizer may acquire different performance, and the choice depends on the specific purposes.
For traditional supervised learning, many labeled data are needed for training a precise model. It is well-known that annotating is both labour and time consuming with large amounts of unlabeled data. Another underlying assumption is that training data and testing data are separately provided while drawn from the same distribution; thus we can use the model trained on the former to predict labels of the latter, while the real situations we may always confront are that the available labeled data are from different sources and are different from what we need to predict. In other words, labeled data from target domain are not always accessible or sufficient. As a consequence, the provided labeled data cannot be trained directly to gain predictors on the target data.
As an efficient method to utilize small number of labeled data, or even unlabeled data from other sources, domain adaptation has obtained more attention in recent years [2][3][4]. Patterns from source domain and target domain are utilized to acquire better predictive ability on target data. Learning from multiple source domains [5] and combining source and target domains [6] are popular methods proposed in recent years. Along with some successful application related to domain adaptation, several works focused on the learning ability on this paradigm. Specifically, [7] studies the generalization bounds of domain adaptation, in which the integral probability metric (IPM) [8] is chosen to measure the distance between the source domain and the target domain. A natural idea is how to combine the theoretical results and the practical algorithm designing, thus creating more efficient learning algorithms.
In this paper, we proposed a framework for domain adaptation combining source and target data, taking the IPM as the regularization term. Since the IPM is defined as the upper bound of the gap between two distributions (source domain and target domain), the regularization term 2 Computational Intelligence and Neuroscience is independent with specific predictors. In other words, many popular learning models can be used under such a framework. For many cases, the empirical risk minimization problems could be solved efficiently as convex optimization problems in considerable times.
The remainder of this paper is organized as follows. Section 2 reviews related works about theoretical analysis of domain adaptation problems and a regularized domain adaptation framework. Section 3 introduced the problem setup of and the derived IPM-based generalization bounds. We propose the framework in Section 4 and report the experimental results of regression and classification in Section 5. Section 6 concludes this paper.

Related Works
There have been many works focused on the theoretical analysis of domain adaptation. Generally speaking, the generalization performance is measured by the size of training set, complexity of function class, and several constants. Specifically for domain adaptation, one also needs to measure the divergence of different distributions. For the complexity measurement of function class, VC-dimension is widely used in traditional learning model as well as in domain adaptation [4,6,9]. Besides VC-dimension, the covering number and Rademacher complexity are also used to measure the function class in generalization bounds of domain adaptation [5,7]. In terms of the measurement of different distributions, H-divergence is used in [4,6]; the same concept is called A-distance in [9] and derived from [10]. It was defined as the upper bound of two probability distributions, which is straightforward for classification. Both [5] and [7] introduce different quantities for more general tasks including regression, while the latter further take the labeling function into consideration.
One significant meaning of theoretical analysis is to provide guidance of designing new algorithms. Most of the above works give out the generalization bounds of domain adaptation to provide important properties of learning process for domain adaptation instead, such as convergence rate, effectiveness, and correctness.
In terms of regularized domain adaptation, a framework called domain adaptation machine (DAM) [11,12] describes a data dependent regularizer, which is based on smoothness assumption and a relevance between source domain and target domain. The framework is similar to our method in some way, while the definition and optimization are different. DAM mainly stresses domain adaptation from multiple sources, while we care about domain adaptation combining source (including multiple sources) and target data, which has different empirical loss as well as regularizer. However, the one regularizer in DAM has close connection with ours and the details can be found in later discussion. Traditional supervised learning aims to learn a function : X ( ) → Y ( ) for labeling unseen samples in D ( ) . In the domain adaptation set-up, D ( ) is hard to estimate directly with insufficient X ( ) . With considerable amounts of X ( ) and Y ( ) , the minimization empirical risk over loss function ℓ(∘) with parameter vector can be expressed as follows:

Domain Adaptation
where ( ) is the expectation taken with respect to the distributions Z ( ) . In order to utilize more information of target domain, available target samples should be used. Given ∈ [0, 1), domain adaptation combining source and target data is defined to minimize the empirical risks [4]: where controls the trade-off between learning from source data and target data.

Integral Probability Metric.
In domain adaptation, it is important to find a quantity measuring the difference of the distributions between the source and the target domains.
In this paper, we use the integral probability metric (IPM) to measure the difference between two distributions. This quantity is defined as the distance between the source domain Z ( ) and the target domain Z ( ) , under function class F ⊂ R Z : The quantity F ( , ) is aimed at measuring the difference between the two probability distributions. If the source domain Z ( ) and the target domain Z ( ) have the same probability distribution, the quantity F ( , ) is equal to zero. Assuming there are samples drawn from source domain and samples from target domain, the expectations ( ) and ( ) can be roughly estimated by these samples; thus the F ( , ) can be approximated by the expectations over given data. However, the target samples are not enough to learn a predictor; that is, ≪ ; then domain adaptation minimize the convex combination of the source and the target empirical risk, for ∈ [0, 1), When = 0, it provides a learning process of the basic domain adaptation with one single source.

Generalization Bounds.
The generalization bounds of a learning process need to consider three essential aspects: complexity measure of function class, Hoeffding-type deviation inequality, and symmetrization inequality.
Different from the classical VC-dimension form, Zhang et al. [7] chose the uniform entropy number to measure the Computational Intelligence and Neuroscience 3 complexity which is derived from the concept of the covering number [13]. The covering number is denoted by N(F, , ), where F is the function class, is a metric on F, and the covering number of F at radius with respect to is the minimum size of a cover of radius . The covering number is not suitable for domain adaptation. As a variant of the covering number, by setting the metric ℓ 1 ( ), the uniform entropy number is defined as follows: ln N 1 (F, , 2 ( + )) := sup ln N (F, , ℓ 1 ( )) . (5) The uniform entropy number is distribution-free and can be chosen as the complexity measure of function class to derive the generalization bounds for domain adaptation.
Hoeffding-type deviation inequality for domain adaptation is an extension of the classical Hoeffding-type deviation inequality which allows the random variables to take values from different domains. It is assumed that F is a function class consisting of bounded functions with the range [ , ]. A function is defined as follows: For any ∈ [0, 1) and any > 0, where the expectation ( * ) is taken on both the source domain ( ) and the target domain ( ) . Symmetrization inequality for domain adaptation has a discrepancy term (1 − ) F ( , ) compared to the classical symmetrization result under the assumption of the same distribution. For any > (1 − ) F ( , ), the probability of the event can be bounded by using the probability of the event where = − (1 − ) F ( , ).
Based on the uniform entropy number, using a specific Hoeffding-type deviation inequality and symmetrization inequality, the generalization bounds of domain adaptation combining source and target data are derived as follows.
Assume that F is a function class consisting of the bounded functions with the range The derived bound contains a term of discrepancy quantity (1 − ) F ( , ).

IPM-Based Regularization Framework
From formula (10), we can see that the generalization bounds of domain adaptation consisted of two parts: integral probability metric (IPM) and the extension of the covering number (referred to as the uniform entropy number). Since the IPM is relatively easy to compute with source data and target data available, it is straightforward to take this term into regularization to reduce generalization error. Besides, it is also intuitive to make full use of target information to construct predictors. For single source, given data X ∈ R × and corresponding label (or target value for regression) y ∈ R , take ∈ R as the parameters of model and ℓ( ; x, ) as the loss of a single sample. The general objective function for supervised learning can be written in the following risk minimization problem: where ( ) is the regularizer and is the balancing parameter. Based on the definition of IPM (3), empirical risk (4), and learning principle (11), we formally propose the framework of domain adaptation combining the source and the target data by replacing the regularizer. Consider where = (1/ ) ∑ =1 ℓ( ; x , ). In [14], the IPM can be empirically estimated by various popular distance metrics by appropriately choosing F. Specifically in the reproducing kernel Hilbert space (RKHS), IPM is called kernel distance or maximum mean discrepancy (MMD) [15]. The empirical estimator of MMD is straightforward: where : X → H is called a feature space mapping function and two feature maps are defined as the kernel, (x ( ) , x ( ) ) = ⟨ (x ( ) ), (x ( ) )⟩.

4
Computational Intelligence and Neuroscience DAM frameworks [12] construct a domain-dependent regularizer for domain adaptation from multiple sources, which is defined as where is the number of source domains, f and f are the decision values from the target classifier, and the th classifier on the unlabeled instances in the target domain. Here the coefficient is set as exp(− × MMD[F, , ] 2 ). From the definition we can see that the regularizer we use in (12) is much simpler than that in DAM. Moreover, the objective function in DAM consists of three parts, other two include the regularizer which controls the complexity of target classifier and the loss of target classifier, while the objective function we use in (12) considers a combination of the loss over source domain and target domain [4].
The proposed framework is also suitable for domain adaptation combining multiple sources, where ( ) and regularization term F ( , ) in (12) are defined as a linear combination of several terms. Consider The generalization bound of domain adaptation from multiple sources has similar form with (10), where the first term on the right side is a linear combination of several IPMs instead of one; see (16).

Experiments
We first carry out experiments on both simple regression and classification problems to verify the effectiveness of (12). For the purpose of easy-to-optimize, we use least square ℓ( ; x, ) = (x − ) 2 as the loss function. It is straightforward in regression since the target value is continuous, while for binary classification there are a few articles that discussed this loss. Reference [16] employed it in text classification and [17] pointed out the rationality of least square loss compared with SVM. Since the loss is quadratic while the IPM is expressed as an absolute value under this setting, it is necessary to convert the regularizer into the squared form of the original value to balance these two terms, and it can be approximated by the gap of losses on target domain and source domain, that is, ( ( ) − ( ) ) 2 . All these tricks make the whole objective function consisting of both loss function and regularizer convex much easier to optimize. We use the limited-memory BFGS provided by package yagtom (https://code.google.com/p/yagtom/) in experiments.
In the last part of experiment, we would apply least squares support vector machine (LS-SVM) [18] as the classifier; the loss function is expressed as ℓ( ; x, ) = ( (x)− ) 2 , where (⋅) is the kernel function. Regularization for LS-SVM is commonly used, ( ) = ‖ ‖ 2 , where parameter controls the balance.
With the fitting accuracy root mean squared error (RMSE) as the criterion, we conducted the following four settings in the experiments: We search the parameter in range of [2 −10 , 2 −9 , . . . , 2 10 ] in setting 4 and = ( 1 )/( 1 + ) in setting 3 and setting 4 according to the similar numeric experiments to evaluate the asymptotic convergence in [7]. 10 rounds for each problem have been conducted and the average of RMSE is recorded as the result. All the results are shown in Table 1.
We can see, in all cases, that RMSE in setting 4 is the smallest. It makes sense to say that the domain adaptation with the IPM regularizer can obtain better performance than without it.

Classification.
When adopting square loss function in binary classification, we require the sample 's label ∈ {−1, 1}. Assume the output label of x iŝ= x ; in case that̂ * > 0 the predicting is right.
The binary classification tests are carried on text datasets email spam (available at http://www.ecmlpkdd2006.org/challenge.html) and parts of 20 newsgroups datasets (http://vc.sce .ntu.edu.sg/transfer learning domain adaptation/). The email spam dataset contains a set of 4000 public labeled emails which is used here as target domain data and other three sets, each of which has 2500 emails annotated by different users and would be used as source domain data. In these four datasets, samples are labeled as nonspam ( = 1) or spam emails ( = −1). The 20 newsgroups datasets recollected by Duan et al. [12] contains three groups and each has a target set with three sources. Details of the datasets used in classification are shown in Table 2.
So we have 12 groups of source-target pairs in total to conduct the experiments; in each pair we randomly choose 1 = 20 samples from the target domain to participate in domain adaptation and the classification accuracy on the rest target set is chosen as the evaluation criterion. The parameters and are picked in the same way as in the regression experiment, and result in each pair is averaged over 10 times running. The comparison of classification accuracy is listed in Table 3.
As we can see, the domain adaptation with the IPM regularizer can obtain better performance than without it and is even better than just training on small target domain samples in most cases.

Classification with LS-SVM.
In order to improve the classification ability in real datasets, we adopt LS-SVM with kernel as the predictor. The square of MMD is easily obtained by (19), by expanding the original definition. Here in the experiments we use linear kernel for convenience of getting MMD (13); that is, (x ( ) , x ( ) ) = (x ( ) ) x ( ) . What is In this part, we adopt a paradigm of domain adaptation combining multiple sources. As a consequence, in settings 2, 3, and 4, the risk on source domain is computed by (15) and in setting 4 the regularization term IPM is computed by (16) and (19). In each problem, there are three sources. First of all, we search the regularization parameter in single LS-SVM predictor, that is, ( ) = ‖ ‖ 2 of (11), in range [0.01 0.1 1 10 100], on the 20 newsgroups datasets. We can see from Figure 1 that the proposed method tends to achieve best testing accuracy and low standard deviations. In all datasets with any value of , setting 1 has the lowest testing accuracy and relatively high standard, due to the insufficient training with small amounts of labeled data. As in most cases, = 0.1 has the best performance; we set this value in the following experiments.
All results on the same datasets listed in Table 2 are shown in Table 4. We can see that in most cases, the proposed algorithm outperformed other methods from a statistical perspective. Setting 1 had the worst accuracy, which means training on small amounts of target data is not sufficient. The fact that accuracy in setting 1 increases as the available labeled data becomes more, which fits the experience of ERM learning. It seems that the performance of setting 2 is even slightly better than setting 3 in most cases; thus simply combining risks over source and target domain to learn may not work in practice. On the other hand, the IPM regularization term does provide a bridge between this gap.