Domain transfer learning aims to learn common data representations from a source domain and a target domain so that the source domain data can help the classification of the target domain. Conventional transfer representation learning imposes the distributions of source and target domain representations to be similar, which heavily relies on the characterization of the distributions of domains and the distribution matching criteria. In this paper, we proposed a novel framework for domain transfer representation learning. Our motive is to make the learned representations of data points independent from the domains which they belong to. In other words, from an optimal cross-domain representation of a data point, it is difficult to tell which domain it is from. In this way, the learned representations can be generalized to different domains. To measure the dependency between the representations and the corresponding domain which the data points belong to, we propose to use the mutual information between the representations and the domain-belonging indicators. By minimizing such mutual information, we learn the representations which are independent from domains. We build a classwise deep convolutional network model as a representation model and maximize the margin of each data point of the corresponding class, which is defined over the intraclass and interclass neighborhood. To learn the parameters of the model, we construct a unified minimization problem where the margins are maximized while the representation-domain mutual information is minimized. In this way, we learn representations which are not only discriminate but also independent from domains. An iterative algorithm based on the Adam optimization method is proposed to solve the minimization to learn the classwise deep model parameters and the cross-domain representations simultaneously. Extensive experiments over benchmark datasets show its effectiveness and advantage over existing domain transfer learning methods.
Transfer learning is a machine learning problem which deals with data from two domains [
In this case, it is very necessary to map the data points of both domains to a common data space so that they lie in the same distribution, and we can directly train a model for the target domain by using both domains’ data points’ representations. Another solution is to learn a model for the source domain first and then adapt it to the target domain. In this paper, we focus on the first solution where the data points are mapped to a common space. This solution aims to learn domain transferable representations for data points in different domains. Different representation learning methods have been applied for domain transferable representation learning, including multikernel learning [
In this paper, we study the problem of learning domain transfer representations. However, we do not consider the distribution matching of two domains but consider learning representations which can be directly generalized to two domains.
In this section, we briefly introduce the state-of-the-art methods for transferable representation learning. Theauthors in [ The authors in [ The authors in [ The authors in [
All the above methods are based on the matching of two domains’ distributions of the data representations. The two key components of this method are the representation of distributions and the metric of the mismatching of the two distributions. In this paper, we give up this framework and propose a completely different framework for domain transfer learning. We observed that for an ideal representation model across two different domains, from its output of one data point, we cannot tell which domain it is from. At the same time, we can separate it from its true class and the other classes according to its output of the cross-domain representation model. It means that the representation of a data point is independent of its domain, but closely relevant to its class. Thus, instead of measuring the mismatch of source and target domain distributions, we measure the independence of the representations and domain-belonging indicators of the data points. To measure the dependency between the representation and the domain indicator, we employ the mutual information. By minimizing the mutual information between them, we learn the domain-independent representations. Meanwhile, we also propose to maximize the margin of each data point so that it can be separated from data points from other classes and kept close to the data points from the same classes.
Motivated by the above ideas, we propose a novel deep learning model for the representation of data points of transfer learning problems. Firstly, to enhance the ability to discriminate data points of different classes, we propose to learn a unique deep convolutional network for each class, named classwise convolutional representation model. This is different from traditional domain transfer representation models, which learns a common model for all classes. To make the outputs of this model independent from the domain indicators, we propose to minimize the mutual information between the representation model outputs and the domain indicators. The mutual information estimation is based on the probability of representations and conditional probability of domain indicators given representations. We develop novel estimators for the conditional probability of domain indicators given representations. The estimator is defined over the neighborhood of the data point of the given representation, and it calculates the normalized summation of the soft weights of the data points from the input domain. To make the outputs of the model to be discriminate, we proposed to maximize the margin of each data point in the corresponding class. The margin is defined as the difference between intraclass dissimilarity and the interclass dissimilarity. The intraclass dissimilarity is defined in an intraclass neighborhood which contains a set of neighboring data points from the same class, while the interclass dissimilarity is defined in an interclass neighborhood which contains a set of neighboring data points from the other classes. To learn the representation model parameters, we build a unified learning framework. The objective function is defined by combining the margins, mutual information, and a squared
The overall diagram of the proposed learning framework for each class is given in Figure
Our contributions are of three folds: For the first time, the idea of learning cross-domain representations which are independent of domains is proposed for transfer learning. Instead of learning representations and making the distributions of two domains’ representations to match each other, we directly learn representations which are independent of their domain-belonging indicators. The mutual information is used to measure such dependency of representations and domain indicators, and it is minimized to seek the domain-independent representations. We develop a novel and practical representation learning method to minimize the mutual information between the data points’ representations and domain indicators. The mutual information between representations and domain indicators of data points is estimated according to the probability of representation and the conditional probability of domain indicator given representation. We estimate the conditional probability of a data point’s domain indicator given its representation over its neighborhood. It is calculated as a summation of the normalized Gaussian kernel based similarity measured of the data points in the neighborhood but from the considering domain. We propose a novel transfer learning framework for learning domain transfer deep representation models. It is a classwise model and we learn the parameters by simultaneously maximizing the margin of each data point of this class and minimizing the mutual information between the data points’ representations and their domain indicators. An iterative algorithm is developed to learn the optimal representations and the parameters of the model to output these representations.
Overall diagram of the proposed learning framework for minimum mutual information domain transfer representation.
The paper is organized as follows: in Section
In this section, we give a list of detailed definitions of the symbols used in the following sections.
We assume we have a set of
We consider a classification problem of
For the
We choose to learn the convolutional representations due to the following two reasons: CNN model is good at extracting local patterns by utilizing a large number of sliding local filters, while in most domain transfer applications discussed in this paper, the local patterns play the most important role. For example, in the cross-domain image categorization task, for two images of different domains but containing the same object, the CNN model can capture the local region of the object with some local filters while ignoring the contexts which may vary in different domains. Another example is the text-related tasks; long sentences of the same topic may have different linguistic styles of different domains, but still contains short phrase, which could be captured by the CNN model effectively by its sliding local filters to extract features from short phrases. CNN model, compared to the other deep learning models, such as recurrent neural network (RNN), has a more efficient training process. The CNN model has a parallel structure, and the responses of a sliding filter are calculated independently from each other; thus, its computing can be easily paralleled by GPU. This is different from the RNN model which has a sequential structure, where the response of a node is calculated based on the response of the previous nodes, which makes its computing time longer than the CNN model.
Naturally, we hope the class-specific convolutional representations can separate the data points of the
Note that to search for the interclass neighbors of
Similarly, we also calculate the interclass affinity between
The local margin of
We proposed to maximize the local margin to improve the ability to separate data points of the
Since we are considering the problem of domain transfer learning, the training data points are from a source domain and a target domain, and we denote the source domain training set as
According to the probability theory and information theory, the mutual information between two variables is a measure of the mutual dependence between them. For two variables,
According to the mutual information’s relation to Kullback–Leibler divergence,
To estimate the mutual information over the training set, we propose to recalculate
In the following, we discuss how to estimate the conditional probability of domain indicator given the convolutional representation,
To estimate the probability of
The weights satisfy the constraints of
We assume the convolutional representations are evenly distributed; thus, we use a simple empirical distribution function to calculate the probability of
Substituting equations (
To simplify the equations, we introduce the following variables:
We rewrite equation (
To learn a cross-domain representation to map the data of both domains to common space, we reduce the dependency of the domain indicator and the convolutional representation variables as much as possible. Since the mutual information measures the dependency, we minimize
In this way, we hope that the learned representations are independent from the domains as much as possible so that it can be generalized to adapt to both domains.
To construct the learning framework for the domain adaptation problem based on the classwise deep CNN representation model, we combine the objects of equations (
It is difficult to solve the problem of equation (
To solve this problem, we use the ADMM algorithm. Following ADMM, we have the following optimization problem:
The updating of
Updating of
We also use the backprorogation algorithm to solve this problem, based on the chain rule:
The dual variable is updated by gradient ascent:
In this section, we give the overall iterative learning algorithm of the proposed minimum mutual information transfer representation (MMITR) method. In this algorithm, it has an updating strategy similar to the expectation-maximization (EM) algorithm. In each iteration, we firstly fix domain transfer representations to update the inter- and intraclass affinity metrics in an E-step and then fix the inter- and intraclass affinity metrics to update the CNN parameters and the representations in an M-step. The iterations are stopped until a maximum iteration number is reached or the objective value reaches a threshold. The overall algorithm is described in Algorithm
Initialize iteration indicator Initialize model parameters and objective value
Update the domain transfer representations by iterating the gradient descent steps of equation ( Update the CNN model parameters by iterating the backprorogation steps of equation ( Update the dual variables by iterating the gradient ascent steps of equation (
When we have a new data point,
In our experiments, we use the following datasets as benchmark datasets: Office-31: this dataset has 4,652 images of 31 classes. The images are from three different domains: Amazon (images downloaded from ImageCLEF-DA: this dataset is composed of images of 12 classes of four domains. Each domain is a unique database, including Caltech-256, ImageNet ILSVRC 2012, Pascal VOC 2012, and Bing. Email spam: this dataset has email texts of spam and nonspam. The data are collected from three users, and each user has 2,500 emails. Each user is treated as a domain. Extended Cohn-Kanade (CK+): this dataset is an image set for facial expression recognition. It has images of 123 subjects, and each subject is treated as a domain. This dataset has 593 videos in total, and for each video, there are about 20 frames. Each image of a face belongs to one of the 7 expression classes. Amazon: this dataset is a text classification task dataset. The texts are from three different domains, and each domain is the review for products of books, DVD, and music. For each domain, there are 2,000 positive review texts and 2,000 negative review texts.
In this experiment, we use each domain of a dataset as a target domain in turn and the remaining domains as the source domains. For each test domain, we use the leave-one-out protocol to split it into a training set and a test set. Each data point of a target domain is used as a test data point in turn, and the remaining data points are combined to form a training set. The training set of the target domain is randomly split to an unlabeled set and a labeled set, with equal size. The data points of the source domains are always treated as labeled in our setting. Our algorithm is performed over the training set to learn the parameters of the classwise representation model and the domain-independent representations and then used to classify the test data points. The classification accuracy is used to evaluate the performance of the algorithm.
In our experiment, we first study the properties of the algorithm experimentally, including its sensitivity to the tradeoff parameters and its convergence property to iteration numbers. Then, we compare its performance to state-of-the-art domain transfer learning algorithms.
Tradeoff parameter sensitivity study of
The accuracy curves of classification with different values of
Tradeoff parameter sensitivity study of
Convergence analysis. (a) Office-31. (b) ImageCLEF-DA. (c) Email spam. (d) CK+. (e) Amazon.
To solve a minimization problem in our algorithm, we employed the ADMM algorithm. To verify if the ADMM algorithm solves the optimization of the minimization problem effectively, we plot the object values of the learning problem with increasing numbers of iterations in Figure
Objective value curve of ADMM algorithm. (a) Office-31. (b) ImageCLEF-DA. (c) Email spam. (d) CK+. (e) Amazon.
Definitions of symbols.
Symbol | Definition |
---|---|
|
|
|
Set of data points of the |
|
Set of unlabeled data points |
|
Domain transfer representation of a data point |
|
Parameters of the CNN model of the |
|
Intraclass neighborhood of the |
|
Interclass neighborhood of the |
|
Intraclass affinity between the |
|
Interclass affinity between the |
|
Domain indicator of the |
|
Soft weight of the the |
We compare our algorithm, MMITR, to several state-of-the-art transfer learning algorithms, including the Deep Adaptation Network (DAN) [
Compared methods.
Method | Data representation | Domain matching criterion |
---|---|---|
MMITR | CNN | Mutual information minimization |
DAN | CNN | Maximum mean discrepancy (MMD) |
STM | — | Kernel mean-matching (KMM) |
SSKMDA | Multikernel learning | Hilbert schmidt independence criterion (HSIC) |
DTMKL | Multikernel learning | MMD |
The comparison of accuracy results is given in Figure
Comparison of accuracy of state of the arts. (a) Office-31. (b) ImageCLEF-DA. (c) Email spam. (d) CK+. (e) Amazon.
The conditions of the methods of the results reported in 6 are described in details as follows. For the DAN algorithm, it has two hyperparameters: the MMD Penalty
In this paper, we proposed a novel framework for transfer learning. Not like the traditional transfer learning which tries to match the representations of source domain and target domain, we proposed to learn domain-independent representations. We argue to measure the dependency of learned deep representations and domain by mutual information and learn the domain-independent deep representations by minimizing the mutual information. We also proposed a practical estimation method for the mutual information between domain and deep representations. A classwise deep representation neural network work is trained under this framework and used to classify new data points. Experiments over benchmark datasets for transfer learning verify the effectiveness of the proposed method.
The new concept proposed in this paper is a novel domain transfer learning framework which minimizes the mutual information between the domain transfer representations and the domain indicators so that the gaps among domains can be effectively leveraged and a common representation space is learned. The new method developed in this is a novel iterative learning algorithm to learn the domain transfer representations based on CNN models.
All the datasets used in this paper are publicly accessed.
The authors declare that they have no conflicts of interest.