Low-Rank Deep Convolutional Neural Network for Multitask Learning

In this paper, we propose a novel multitask learning method based on the deep convolutional network. The proposed deep network has four convolutional layers, three max-pooling layers, and two parallel fully connected layers. To adjust the deep network to multitask learning problem, we propose to learn a low-rank deep network so that the relation among different tasks can be explored. We proposed to minimize the number of independent parameter rows of one fully connected layer to explore the relations among different tasks, which is measured by the nuclear norm of the parameter of one fully connected layer, and seek a low-rank parameter matrix. Meanwhile, we also propose to regularize another fully connected layer by sparsity penalty so that the useful features learned by the lower layers can be selected. The learning problem is solved by an iterative algorithm based on gradient descent and back-propagation algorithms. The proposed algorithm is evaluated over benchmark datasets of multiple face attribute prediction, multitask natural language processing, and joint economics index predictions. The evaluation results show the advantage of the low-rank deep CNN model over multitask problems.


Backgrounds
In machine learning applications, multi-task learning has been a popular topic (Argyriou, Evgeniou and Pontil, 2007;Baniata, Park and Park, 2018;Caruana, 1997;Doulamis and Voulodimos, 2016;Evgeniou and Pontil, 2004;Jacob, Vert and Bach, 2009;Mao, Mu, Zheng and Yan, 2014;Ruder, 2017;Xue, Liao, Carin and Krishnapuram, 2007;Zhou, Wang, Jiang and Li).It tries to solve multiple related machine learning problems simultaneously.The motive is that for many situations, multiple tasks are closely related, and the prediction results of different tasks should be consistent.Accordingly, borrowing the prediction of other tasks to help the prediction of a given task is natural.For example, in the face attribute prediction problem, given an image, the prediction of female gender and wearing long hair is usually related (Cho and Wong, 2008;Owusu, Zhan and Mao, 2014;Wong and Cho, 2010;Yao, Chen, Jia and Liu, 2018;Zhong, Sullivan and Li, 2016).
Moreover, in the problem of natural language processing, it is also natural to leverage the problems of part-of-speech (POS) tagging and noun chuck prediction, since a word with a POS of a noun usually appears in a noun chunk (Collobert and Weston, 2008;Herath, Ikeda, Ishizaki, Anzai and Aiso, 1992;Jabbar, Iqbal, Akhunzada and Abbas, 2018;Lyon and Frank, 1997;Shams and Mercer, 2016).Multi-task learning aims to build a joint model for multiple tasks from the same input data.
In recent years, deep learning has been proven to be the most powerful data representation method (Chao, Zhi, Dong and Liu, 2018;Chu, Huang, Xie, Tan, Kamal and Xiong, 2018;Geng, Zhang, Li, Gu, Liang, Liang, Wang, Wu, Patil and Wang, 2017;Glorot, Bordes and Bengio, 2011;Guo, Liu, Oerlemans, Lao, Wu and Lew, 2016;Hu, Wang, Peng, Qiu, Shi and Liu, 2018;Längkvist, Karlsson and Loutfi, 2014;LeCun, Bengio and Hinton, 2015;Ngiam, Khosla, Kim, Nam, Lee and Ng, 2011;Sadouk, Gadi and Essoufi, 2018;Schmidhuber, 2015;Voulodimos, Doulamis, Bebis and Stathaki, 2018a;Voulodimos, Doulamis, Doulamis and Protopapadakis, 2018b;Wu, Zhai, Li, Cui, Wang and Patil;Zhang, Liang, Li, Fang, Wang, Geng and Wang, 2017;Zhang, Liang, Su, Qu and Wang, 2018a).Deep learning methods learn a neural network of multiple layers to extract the hierarchical patterns from the original data, and provide high-level and abstractive features for the learning problems.For example, for the face recognition problems, a deep learning model learn simple patterns by the low-level layers, such as lines, edges, circles, and squares.In the median-level layers, parts of faces are learned, such as eyes, noses, mouths, etc.In the high-level layers, patterns of entire faces of different users are obtained.With the deep learning model, we can explore the hidden but effective patterns from the original data directly with multiple layers, even without domain knowledge and hand-coded features.This is a critical advantage compared to traditional shallow learning paradigms models.
Remark: If shallow learning paradigms is applied in this case, the model structure will not be sufficient to extract complex hierarchical features.The users of these shallow learning models have to code all these complex hierarchical features manually in the feature extraction process, which is difficult and some times impossible.
Remark: If other non-neural networks learning models is used, such as spectral clustering, the hidden pattern of input data features cannot be directly explored.For example, spectral clustering treats each data point as a node in a graph, and sperate them by cutting the graph.However, it still needs a powerful data representation method to build the graph, and itself cannot work well with a high-quality graph.Meanwhile, neural network models, especially deep neural network models, have the ability to represent the hidden patterns of input data points and build the high-quality graph accordingly.Thus the non-neural network models and neural network models are complementary.Most recently, deep learning has been found very effective for multi-task learning problems.For example, the following works have discussed the usage of deep learning for multi-task prediction.
• Zhang et al. (Zhang, Luo, Loy and Tang, 2014) formulated a deep learning model constrained by multiple tasks, so that the early stopping can be applied to different tasks to obtain good learning convergence.Furthermore, different tasks regarding face images, including facial landmark detection, head pose estimation, and facial attribute detection are considered together by using a common deep convolutional neural network.• Liu et al. (Liu, Gao, He, Deng, Duh and Wang, 2015a) proposed a deep neural network learning method for multi-task learning problems, especially for learning representations across multiple tasks.The proposed method can combine cross-task data, and also regularize the neural network to make it generalized to new tasks.It can be used for both multiple domain classification problems and information retrieval problems.• Collobert et al. (Collobert and Weston, 2008) proposed a convolutional neural network for multi-task learning problem in natural language processing applications.The targeted multiple tasks include POS tagging, noun chunk prediction, named entity recognition, etc.The proposed network is not only applied to multi-task learning, but also applied to semi-supervised learning, where only a part of the training set is labeled.• Seltzer et al. (Seltzer and Droppo, 2013) proposed to learn a deep neural network for multiple tasks which shares the same data representations.This model is used to the applications of acoustic models with a primary task and one or more additional tasks.The tasks include phone labeling, phone context prediction, and state context prediction.
However, the relation among different tasks is not explored explicitly.Although the deep neural model can learn effective high-level abstractive features, without explicitly explore the relation of different tasks, different groups of level features may be used to different tasks.Thus the deep features are separated for different tasks, and the relationships among different tasks are ignored during the learning process of the deep network.
To solve this problem, we propose a novel deep learning method by regularizing the parameters of the neural network regarding multiple tasks by low-rank.

Our Contributions
The proposed deep neural network is composed of four convolutional layers, three maxpooling layers, and two parallel fully connected layers.The convolutional layers are used to extract useful patterns from the local region of the input data, and the max-pooling layers are used to reduce the size of the intermediate outputs of convolutional layers while keeping the significant responses.The last two fully connected layers are used to map the outputs of convolutional and max-pooling layers to the labels of multiple tasks.
The rows of the transformation matrices of the full connection layers are corresponding to the mapping of different tasks.We assume that the tasks under consideration are closely related, thus the rows of the transformation matrices are not completely independent to each other, thus we seek such a transformation matrix with a minimum number of independent rows.We use the rank of the transformation matrix to measure the number of the independent rows and measure it by the nuclear norm.During the learning process, we propose to minimize the nuclear norm of one fully connected layer's transformation matrix.Meanwhile, we also assume that for a group of related tasks, only all the high-level features generated by the convolutional layers and max-pooling layers are useful.Thus it is necessary to select useful features.To this end, we propose to seek sparse rows for the second fully connected layer.The sparsity of the second transformation matrix is measured by its ℓ 1 norm, and we also minimize it in the learning process.Of course, we hope the predictions of the two fully connected layers could be low-rank and sparse simultaneously, and also consistent with each other.Thus we propose to minimize the squared ℓ 2 norm distance between the prediction vectors of the two fully connected layers.Meanwhile, we also reduce the prediction error and the complexity of the filters of the convolutional layers measured by the squared ℓ 2 norms.The objective function is the linear combination of these terms.
We developed an iterative algorithm to minimize the objective function.In each it-eration, the transformation matrices and the filters are updated alternately.The transformation matrices are optimized by the gradient descent algorithm, and the filters are optimized by the back-propagation algorithms.

Paper Origination
The rest parts of this paper are organized as follows.In section 2, we introduce the proposed method by modeling the problem as a minimization problem and develop an iterative algorithm to solve it.In section 4, we conclude the paper.

Problem Modeling
Suppose we have a set of n data points for the training process, denoted as {x 1 , • • • , x n }, where x i is the i-th data point.x i could be an image (presented as a matrix of pixels) or text (a sequence of embedding vectors of words).The problem of multi-task learning is to predict the label vectors of m tasks.For x i , the label vector is denoted as , where y ij = 1 if x i is a positive sample for the j-th task, and To this end, we build a deep convolutional network to map the input data point to an output label vector.The network is composed of 4 convolutional layers, 3 max-pooling layers, and 2 parallel fully-connected layers.The structure of the deep network is given in Figure 1.Please note that for different types of input data, the convolutional and maxpooling layers are adjustable.For matrix inputs such as images, the layers perform 2-D convolution and 2-D max-pooling, while for sequences such as text, the layers conduct 1-D convolution and 1-D max-pooling.
We denote the intermediate output vector of the first 7 layers as φ(x) ∈ R p , where x is the input, and p is the number of pools of the last max-pooling layer.The set of filters in the convolutional layers of φ(x) are denoted as Φ.The outputs of the two parallel fully connected layers are denoted as where W 1 ∈ R m×p and W 2 ∈ R m× are the transformation matrix of the two layers.The two fully connection layers map the a p-dimensional vector φ(x) to two vectors of m scores for m tasks.Each score measures the degree of the given data point belonging to the positive class.The two fully connected layers are corresponding to the low-rank and sparse prediction results of the network.By fusing their results, we can explore both the low-rank structure of the prediction scores of multi-tasks, and also the sparse structure of the deep features learned from the network.In our model, the first fully connected layer f 1 (x) is responsible for the low-rank structure, while the second fully connected layer f 1 (x) is responsible for the sparse structure.
The final outputs of the network are the summation of the outputs of the two fully connected layers, (2) To learn the parameters of the deep network of g 1 (x), we consider the following four problems.
• Low-Rank Regularization As we discussed earlier, the tasks are not completely independent from each other, but they are closely related to each other.To explore the relationships between different tasks, we learn a deep and shared representation φ(x) for the input data x.Based on this shared representation, we also request the transformation matrix W 1 of one of the last fully connected layer to be of lowrank.The motive is that the m columns of W 1 actually map the representation φ(x) to the m scores of m-tasks.The rank of W 1 measures the maximum number of linearly independent columns of W 1 .Thus by minimizing the rank of W 1 , we can impose the mapping functions of different tasks to be dependent on each other and minimize the number of independent tasks.To measure the rank the matrix W 1 , rank(W 1 ), we use the nuclear norm of W 1 , denoted as ||W 1 || * .||W 1 || * is calculated as the summation of its singular values, where ̺ l is its l-th singular value.We propose to learn W 1 by regularizing its rank as follows, min • Sparse Regularization We further regularize the mapping transformation matrix of the second fully connected layer by sparsity.The motive of the sparsity is that the effective deep features for different tasks might be different, and for each task, not all the feature are needed.Although we learn a group of deep features in φ(x) and share it with all the tasks, for a specific task and its relevant tasks, only a small number of deep features are necessary, and feature selection is a critical step.For the purpose of features selection, we impose the sparsity penalty to the transformation matrix of the second fully connected layer, W 2 , since it maps the deep features to the prediction scores of m tasks.To measure the sparsity of W 2 , we use the ℓ 1 norm of W 2 , which is the summation of the absolute values of all the elements of the matrix, We minimize the ℓ • Prediction Consistency The outputs of the two fully connected layers of lowrank and sparsity may give different results.However, we how they can be consistent with each other so that the prediction results can be low-rank and sparse simultaneously.To this end, we impose to minimize the squared ℓ 2 norm distance between the prediction results of the two layers over all the training data points, min • Prediction Error Minimization We also propose to learn an effective multi-task predictor by minimizing the prediction error.To measure the prediction error of a data point, x, we calculate the squared ℓ 2 norm distance between its prediction result g(x) and its true label vector y, We learn the parameters of the deep network by minimizing the errors over all the training data points, min • Complexity Reduction Finally, we regularize the filters of the convolutional layers, Φ, by the squared ℓ 2 norms to prevent the network from being over complex, min The overall optimization problem is the weighted combination of the problems above, min where C 1 , C 2 , C 3 and C 4 are the weights of different objective terms, and g is the overall objective function.By optimizing this problem, we can obtain a deep convolutional network with a low-rank and sparse deep features for the problem of multi-task learning.

Optimization
To solve the problem in ( 12), we use the alternate optimization method.The parameters are updated iteratively in an iterative algorithm.When one parameter is updated, others are fixed.In the following sections, we will discuss how to solve them separately.

Updating W 1
When we update W 1 , we fix W 2 and Φ, remove the terms irrelevant to W 1 from (12), and obtain the following optimization problem, min where g 1 is the objective function of this problem.To solve this problem, we use the gradient descent algorithm.W 1 is descended to the direction of the gradient of g 1 (W 1 ), where ∇g 1 (W 1 ) is the gradient function of g 1 (W 1 ), and ς is the descent step size.To calculate the gradient function ∇g 1 (W 1 ), we first split the objective into two terms, g 1 (W 1 ) = g 11 (W 1 ) + g 12 (W 1 ), where and The first term g 11 (W 1 ) is a quadratic term while g 12 (W 1 ) is a unclear term.Thus the gradient function of g 1 (W 1 ) is the sum of the gradient functions of the two terms, where ∇g 11 (W 1 ) can be easily obtained as To obtain the gradient function of g 12 (W 1 ) = C 2 W 1 * , we first decompose W 1 by singular value decomposition (SVD), where U and V are the two orthogonal matrices, Σ is a diagonal matrix containing all the singular values.According to the Proposition 1 of (Zhen, Yu, He and Li, 2017), the gradient of

Updating W 2
To update W 2 , we also fix other parameters and remove the irrelevant terms, where is a quadratic term, and is a ℓ 1 norm term.We also use the gradient descent algorithm to update W 2 , To obtain the gradient function of g 21 (W 2 ), we rewrite W 2 and W 2 1 as follows, w 2i 1 , and and The gradient function of g 22 (W 2 ) regarding W 2 , we decompose the problem to the gradients of g 22 regarding to differen rows of W 2 , since in the problem the rows are independent to each other, where ∇g 22 (w 2i ) is the gradient of g 22 regarding w 2i , and according to (23), we have the sub-gradient of g 22 as follows,

Updating Φ
To optimize the filters of the deep network, we fix both W 1 and W 2 and use the backpropagation algorithm based on the chain rule.The corresponding problem is given as follows, is a data point-wise term.Back propagation is based on gradient descent algorithm, and according to the chain rule, where

Experiments
In this section, we test the proposed method over several multi-task learning problems and compare it to the state-of-the-art deep learning methods for the multi-task learning problem.

Experiment Setting
We test the proposed method over the following benchmark data sets.
• Large-scale CelebFaces Attributes (CelebA) Dataset The first dataset we used is a face image data set, named CelebA Dataset (Liu, Luo, Wang and Tang, 2015b).This data set has 202,599 images and each image has 40 binary attributes, such as wearing eyeglasses, wearing hats, having a pointy nose, smiling, etc.The prediction of each attribute is treated as a task, thus this is 40-task multi-task learning problem.The input data is image pixels.The downloading URL for this data set is at http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html.• Annotated Corpus for Named Entity Recognition The second data set we used is a data set for named entity recognition.It contains 47,959 sentences, which contains 1,048,576 words.Each word is tagged by a named entity type, such as Geographical Entity, Organization, Person, etc, or a non-named entity.Moreover, each work is also tagged by a part-of-speech (POS) type, such as noun, pronoun, adjective, determiner, verb, adverb, etc.Meanwhile, we also have the labels of noun chunk.We have three tasks for each work, named entity recognition (NER), POS labeling, and noun clunking.For each word, we use a window of size 7 to extract the context, and the embedding vectors of the words in the window are used as the input.This dataset can be downloaded from https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus.• Economics The third data set we used is a data set for tasks of property price trend and stock price trend prediction.The input data is the wave of historical data of property prices and stock prices, and each data point is the data of three months of both prices of properties and stocks, and the label of each data point is the trend of stock price and property price.We collect the data of last 20 years of USA and China, and generate a total number of 480 data points.
In the experiments, we split an entire data set to a training set and a test set of the equal sizes.The training set is used to learn the parameters of the deep network, and then we use the test set to evaluate the performance of the proposed learning method.To measure the performance, we use the average accuracy for different tasks.

Comparison of prediction accuracy of different methods
We compare the proposed method against several deep learning-based multi-task methods, including the methods proposed by Zhang et al. (Zhang et al., 2014), Liu et al. (Liu et al., 2015a), Collobert et al. (Collobert andWeston, 2008), and Seltzer et al. (Seltzer and Droppo, 2013).The results are reported in Fig. 2. According to the results, the proposed methods always achieves the best prediction performances, over three multitask learning tasks, especially in the NER and Economics.For the Economics benchmark dataset, our method is the only method which obtains an average prediction accuracy higher than 0.80, while the other methods only obtain accuracies lower than 0.75.This is not surprising since our method has the ability to explore the inner relation between different tasks by the low-rank regularization of the weights of the CNN model for different tasks.In the Economics benchmark data set, the number of training examples is small, thus it is even more necessary to borrow the data representation of different tasks.For the CelebFaces data set, the improvement of the proposed method over the other methods are slight.Moreover, we also observe that the methods Zhang et al. (Zhang et al., 2014) and Liu et al. (Liu et al., 2015a) outperforms the methods of Collobert et al. (Collobert and Weston, 2008), and Seltzer et al. (Seltzer and Droppo, 2013) in most cases.

Comparison of running time of different methods
We also report the running time of the training processes of the compared methods in

Influence of tradeoff parameters
In our method, there are four important of tradeoff parameters, which controls the weights of the terms of classification errors, the rank of the weight matrix, and the ℓ 1 norm sparsity of the weight matrix, and the consistency of predictions of the sparse model and low-rank model.The four tradeoff parameters are C 1 , C 2 , C 3 and C 4 .We study the influences of the changes of their values to the prediction accuracy and report the results of our method with varying values of these parameters in Fig. 4. We have the following observations as follows.
• According to the results in Fig. 4, when the values of C 1 increase from 0.01 to 100, the prediction accuracy keeps growing.This is due to the fact that this parameter is the weight of the classification error term, and when its value is increasing the classification error over the training set plays a more and more important role in the learning process, thus it boosts the classification performance accordingly.But when its value is larger than 100, the performance improvement is not significant anymore.
• When the values of C 2 increases, the performance of the proposed keeps improving.This is due to the importance of the low-rank regularization of the proposed method.C 2 controls the weight of the low-rank regularization term, and it is the key to explore the relationships among different tasks of multi-task problem.This is even more obvious for the Economics data set, where the data size is small, and cross-task information plays a more important role.• The proposed algorithm seems stable to the changes in the values of C 3 , which is the weight of the sparsity term of the objective.This term plays the role of feature selection over the convolutional representation of the input data.The stability over the changes of C 3 implies that the convolutional features extracted by our model already give good performances, thus the feature selection does not significantly improve the performances.• For the parameter C 4 , the average accuracy improves slightly when its value increases until it reaches 100, then the performances seem to decrease slightly.This suggests that the consistency between sparsity and low-rank somehow improves the performance, but it does not always help.For forcing the consistency with a large weight for the consistency term, the performance will not be improved.

Conclusion
In this paper, we proposed a novel deep learning method for the multi-task learning problem.The proposed deep network has convolutional, max-pooling, and fully connected layers.The parameters of the network are regularized by low-rank to explore the relationships among different tasks.Meanwhile, it also has the function of deep feature selection by imposing sparsity regularization.The learning of the parameters are modeled as a joint minimization problem and solved by an iterative algorithm.The experiments over the benchmark data sets show its advantage over the state-of-the-art deep learning-based multi-task models.In the future, we will apply the proposed method to other fields, such as global forest product prediction (Jiang, Carter, Fu, Jacobson, Zipp, Jin and Yang, 2019;Liu, Gao, Chen, Fu, Jiang, Wang and Kou, 2019;Liu, Wang and Fu, 2018), biomaterials science (Zhang, Kramer, Smith, Allen, Leeper, Li, Morton, Gallazzi and Ulery,  2018), bio-informatics, (Gui, Ma, Wang and Wilkins, 2016;Gui et al., 2016), wireless networks (Liu, Chen and Zhan, 2013;Liu, Zhan and Chen, 2014;Yang and Zhao, 2015;Yang, Zhao, Hai, Liu, Hoi and Li, 2016), information communication (Wang, Millet and Smith, 2014, 2016), etc. learning, in: European Conference on Computer Vision, Springer. pp. 94-108. Zhen, X., Yu, M., He, X., Li, S., 2017. Multi-target

Figure 2 .
Figure 2. Prediction performance of compared methods over benchmark data sets.

Figure 3 .
Figure 3. Running time of compared methods over benchmark data sets.

Fig. 3 .
According to the results reported in the figure, the training process of Seltzer et al. (Seltzer and Droppo, 2013)'s method is the longest, and the most efficient method is Collobert et al. (Collobert and Weston, 2008)'s algorithm.Our method's running time of the training process is longer than Zhang et al. (Zhang et al., 2014) and Collobert et al. (Collobert and Weston, 2008)'s methods, but still acceptable for the data sets of CelebFaces and NER.While for the training process over the Economics benchmark data set, the running time is very short compared to the other two data sets, since its size is relatively small.

Figure 4 .
Figure 4. Influences of tradeoff parameters over benchmark data set.