A Multitask Learning Model with Multiperspective Attention and Its Application in Recommendation

Training models to predict click and order targets at the same time. For better user satisfaction and business effectiveness, multitask learning is one of the most important methods in e-commerce. Some existing researches model user representation based on historical behaviour sequence to capture user interests. It is often the case that user interests may change from their past routines. However, multi-perspective attention has broad horizon, which covers different characteristics of human reasoning, emotions, perception, attention, and memory. In this paper, we attempt to introduce the multi-perspective attention and sequence behaviour into multitask learning. Our proposed method offers better understanding of user interest and decision. To achieve more flexible parameter sharing and maintaining the special feature advantage of each task, we improve the attention mechanism at the view of expert interactive. To the best of our knowledge, we firstly propose the implicit interaction mode, the explicit hard interaction mode, the explicit soft interaction mode, and the data fusion mode in multitask learning. We do experiments on public data and lab medical data. The results show that our model consistently achieves remarkable improvements to the state-of-the-art method.


Introduction
In the real world, there are some scenarios for multitask learning. In the e-commerce field, we need to increase click through rate (CTR) and order conversion rate (CVR) at the same time. In the music field, we need to improve the song opening rate and the effective playback rate. In the Chinese medical case recommendation, we need to improve the click rate of medical records and the user satisfaction. To improve the recommendation accuracy, Chen et al. [1] propose an improved collaborative filtering algorithm, which introduces the Bhattacharyya similarity calculation into the traditional calculation formula. However, the single-task learning cannot take into account multiple indicators at the same time. In this context, the study of multitask learning emerges. On the bases of shared bottom, multi-gate mixture of experts (MMOE) [2] designs different gate networks for different tasks. By updating the weights of experts, it is better to describe the characteristics about all tasks. It has an improved effect on some tasks that are not very related to each other. In video recommendation, in order to improve user engagement and user satisfaction, Zhao et al. [3] propose shallow subnetwork. It also solves the online and offline problem of sample bias. As is known to all, order behaviour occurs after the click action. e model training process is performed in the click sample subspace, and applied in the entire space online, which will cause sample deviation. Wen et al. [4] add these intermediate behaviours to the model by improving loss function. Previous multitask learning manually turns hyperparameters, which could not balance the network flexibility and performance cost.
Subnetwork routing (SNR) [5] is not sensitive to the strength of the correlation between tasks. It can be combined to learn a good structure and can realize flexible parameter sharing. Qin et al. [6] propose a model, which can combine MMOE and Long Short-Term Memory (LSTM) together. e model applies user behaviour sequence feature in multitask learning scenarios. Real application scenarios always face the challenge of data sparsity, data heterogeneity, and complex multiobjective, which the MMOE and LSTM try to solve. e Progressive Layered Extraction (PLE) [7] network is proposed, whose purpose is to leverage the seesaw phenomenon in multitask learning. To solve the negative transformer problem, on the one hand, PLE model splits the experts into shared expert and private expert; on the other hand, PLE model divides the sample space by loss function. Wang et al. [8] propose a Multitask-Aware Fairness (MTAF) method to improve fairness in multitask learning. Xi et al. [9] propose an Adaptive Information Transfer Multitask (AITM) framework, which constructs the sequential dependence among multistep conversions by the Adaptive Information Transfer (AIT) module. Low-rank decomposed self-attention network (Light-SAN) [10] is proposed, which learns the context-aware representation via users' history items and mines sequential relations among items efficiently. Gating-Enhanced Multitask Neural Networks (Gem-NN) [11] design a gating mechanism between embedding layer and MLP, which learns feature interaction and manages information flow. Multiple-Level Sparse Sharing Model (MSSM) [12] is proposed, which includes a field-level sparse connection module (FSCM) and a cell-level sparse sharing module (CSSM). e FSCM can learn features selectively and the CSSM can share knowledge across all tasks efficiently. To resolve the selection bias and data sparsity issue, Hierarchically Modelling both Micro and Macro behaviour (HM 3 ) [13] is proposed for CVR prediction, which employs micro and macro post-click behaviour in a multitask learning mode. Zhao et al. [14] propose multiple relational attention network, which employs attention mechanism to improve prediction accuracy. e model structure comes from three perspectives: the first is task and feature, the second is feature and feature, and the third is task and task. In recommendation systems, the pareto algorithm is applied to the multiobjective learning, which can make at least one objective better without harming the other objective. e loss function refers to the KKT condition and the relax constraints, and then the model updates the weights at each batch. With the idea of knowledge distillation, Tang et al. [15] propose a novel model, which employs dominant feature to guide multitask learning. e feature matching algorithm combines original feature and dominant feature, which maps them to a new hidden space and improves the efficiency of multitask information sharing. Wang et al. [16] propose a new model to improve relation extraction algorithm.
e embedding layer represents sharing information, which uses Bidirectional Encoder Representation from Transformer (BERT) pretrained model as initial computing part. e model introduces knowledge distillation to use the information of auxiliary tasks better. According to the multitask learning framework, Shao et al. [17] introduce attention map convolutional layer to mine the bilateral high-order feature graph from user and commodity. e model can dynamically capture the users' implicit interest for commodity. Yao et al. [18] propose a strong aggregation multitask learning method, which can group tasks by learning representation vectors. is method assumes that one task is a linear combination of other tasks. e correlation between tasks is calculated through the statistical coefficient. Based on the knowledge graph, Yu et al. [19] propose a multitask feature learning method using the knowledge graph to calculate the embedding vector assist the recommendation task finally. Conversation recommendation is becoming an important part of e-commerce. In order to improve the prediction effect via mining sequence feature, Chen et al. [20] employ the graph structure cascade and node sequence diffusion. e model proposes a sharing representation layer, which helps to understand the task of cascading relationship. e sequence knowledge is learned from the share representation layer, which can encode the cascade structure and sequence node well. Most multitask build network through multilayer feature sharing.
However, the above studies in multitask learning are based on feature engineering and knowledge representation, without introducing multi-perspective attention. We integrate coarse-grained attention, fine-grained attention, boosting expert mode, and expert-level self-attention; therefore, different task experts can interact better. e rest of this paper is organized as follows. Section 2 introduces application of recommendation system in academic and industry. Section 3 discusses the recall stage, the ranking stage, and the diversity stage in the recommender system, and describes our specific improvement methods. Section 4 makes experiment in public data set, and compares the baseline. Section 5 draws conclusion and proposes prospects.
e main contributions of our proposed model are summarized as follows: (1) We introduce coarse-grained attention and finegrained attention in the gate network. Each task layer learns a query vector for each expert, and inner product is taken on the query vector and the expert, then regarding the result as the attention. e gate attention methods achieve better performance than the base MMOE. (2) Inspired by the fact that the gradient boosting tree is better than random forest, we design the gradient boosting expert network, which enhances the interaction among different experts. (3) To the best of our knowledge, we are the first to introduce the expert-level multi-head self-attention into multitask learning and get better effectiveness.

Related Work
2.1. Multitask Learning Architecture. In the deep neural network, the click task and the order task are weighted in different proportions, and then they are processed as positive samples. e idea of a single-task model is difficult to find trade-off between click and order tasks. e model pays more attention to a certain part so that it learns information perhaps deviated from the original sample distribution. In addition, the single-task processing ignores some information, which contains rich correlation among tasks. Use multitask learning to optimize multiple targets at the same time. Share parameters to learn correlation. Subtask learns the differences of the sample distribution. By this way, we improve generation ability of the model.
As is known to all, most multitask learning networks have feature parameter sharing module, which is divided into hard sharing and soft sharing specifically. Hard sharing feature is constructed at the bottom layer and completely shared. e upper layer introduces different networks so as to predict their respective tasks. When the tasks are more relevant, hard sharing is much more effective. Negative transfer will occur when tasks are not very relevant. If the effect of one task increases, the effect of another task decreases. In order to solve this problem, Google proposes MMOE model. e model constructs gate control mechanism for each task, which brings better effects. Tencent proposes the PLE model. Trying to introduce multiple layers of shared experts and private experts resolves the heterogeneous relationship between tasks furtherly. e structure of MMOE model is shown in Figure 1.
where n i�1 g k (x) i � 1, g k (x) i represents the output logits of gating at the ith expert, which is used to calculate the weight of the ith expert. f i (x) denotes the ith expert network; h k (.) means the hidden layer. Furtherly, the gate network equation is as follows: (2)

Expert Network Part
Step 1. Construct a neural network for each expert and get the output y.
where X means the input features, whose shape is [batch size, feature size]. hidden 1  Step 2. Build a list of expert outputs, which is used to restore the output of each expert.
Step 3. In the last dimension of expert output, we use flatten operation to stack the y; then we store it as a tensor. e tensor shape is [batch size, units of the second hidden layer, the number of experts].

Gate Network Part
Step 1. Construct a neural network for each gate and get the gate output y.
where X means the input features, whose shape is [batch size, feature size]. hidden 1  Step 2. Construct a gate dictionary named gates output, whose key is the task name and whose value is the output y of the last gate network layer.
Step 3. Convert gate output into weights, y is expanded on the axis index-1. After that, the number of neurons in the last layer of expert is copied as weights matrix. e shape of weights matrix is [batch size, units of the second expert hidden layer, units of the second gate hidden layer].
Step 4. Using expert output and gating weights, we calculate the tensor which is connected to the tower. Both the expert output after stacking and the weights after expanded dimension have the same shape. Given a scalar inner product, we get a vector with shape [batch size, units of the second expert hidden layer, units of the second gate hidden layer]. We do reduce-sum operation in the last dimension, which calculates the final expert gate output. e shape is [batch size, units of the second expert hidden layer].

Multitask Learning in Recommendation.
In recommendation scenario, the parameters that can be debugged for multitask learning mainly include the following: Computational Intelligence and Neuroscience (1) Label weight: it is similar to the class weight in the deep neural network configuration, to control the sample ratio of each label. (2) Loss weight: setting the weight of loss function for each task. e parameter needs to be adjusted by multiple rounds, and then the optimal combination is selected. (3) Export weight: the weight for predicting score of each task, which can be set higher weight for the better task based on the test result.

The Proposed Scheme
We think there are two parts where MMOE can be improved. e first point is how experts share the parameters with each other, and how to add attention mechanisms effectively. e second point is the design of the loss function, and how to balance the learning of different tasks.

Coarse-Grained Attention Gate
Network. In MMOE model, gate network is a linear transformation, which learns parameters from the original features. e expression skills of gate are insufficient. We use the attention mechanism to calculate the model weights, which are updated with the model trained. We improve the calculation of the original gate network, which is from a linear translation to an inner product operator.
By the guidance of the experts, the model weights are constructed. e design of gate network introduces the prior knowledge of experts. From the view of expert neuron dimension, the output of each neuron is different. Attention is added in the neuron dimension. We add weights in the gate control perspective, and change the gate attention mechanism. We make the improvements on the basis of MMOE, which is shown in Figure 4.
Gate improvement part is as Figure 4 shows. MMOE calculates the weights of different experts by fusing the original feature and gate net output. Inspired by the attention mechanism, each task layer learned a query vector for each expert network. Take inner product between the query vector and the expert network. en regard the result of inner product as the attention weight of the task's corresponding expert. e improvement scheme is expressed in the following formula: where X represents the original input, w e1 and w e2 denote the matrix parameters of expert network, b e is the bias of expert network, and f e (.) represents the transformation function from the original input to the expert vector.
where w g is the parameter of the gate network, e g is the query vector for the initialize gate network, b g is the bias of the gate network, and σ represents the mapping operator.  where h, t denote transform function and ⊙ means inner product operation.
Gate-improved attention is more associated with expert matching and more specific to task representation.

e Part of Expert
Step 1. Build a neural network for each expert and get the output y.
Step 2. Build a list of experts output, which stores the result of expert.
Step 3. Stack the experts output in the last dimension, and the tensor shape is [N, 128, 8].

e Part of Gate Network Improvements
Step 1. Build a neural network for each gate. e gate has one layer with a shape of [1,128], in which 128 is the number of neurons in the last layer of MMOE expert units.
Step 2. Store the gates output of each task in the dictionary named gates output.
Step  Computational Intelligence and Neuroscience 5 mechanism. We obtain the initial query vector with the shape of [N, 8,128], and then aggregate using reducesum function in the last dimension. We get attention dot tensor with the shape of [N, 8].
Step 5. By expanding and copying the attention dot tensor, we calculate the expert weights with the shape of [N, 128, 8]. e shape of weights and the shape of experts are the same.
Step 6. We add weights for the experts and calculate the final output of [N, 128].
Our main improvement is using the expert information to design a query vector for each gate, by the attention mechanism. e fine-grained attention based on the coarsegrained attention makes different weight values in the embedding dimension. e description about fine-grained attention is shown in following part.

Fine-Grained Attention Gate Network.
In the dimension of the expert neuron and the dimension of the embedding, we employ attention together. In this way, the gate control network is not only a simple two-layer fully connected network, but also the result of combining the initial gate with the expert by attention mechanism. e model learns the fine-grained query vector for each task.

Expert Network Part.
It is the same as the expert network part of MMOE coarse-grained attention.

Attention Gate Network Part.
e coarse-grained attention constructs the neural network for each gate with the shape of [1,128]. And then, the gate network and expert with the shape of [N, 8,128] make multiply product operation. We design query network for each task with the shape of [N, 8,128], in which 128 dimensions are different while 8 dimensions are the same. e fine-grained attention is different in both 128 dimensions and 8 dimensions, which can better adapt the different correlation tasks.

Gradient Boosting Expert
Network. In MMOE model, the experts can be regarded as random forest. In order to make different experts interact better, we improve the experts' mode from random forest to gradient boosting decision tree. We construct an expert list named hub-list, which is used to store the output of each expert. When the hub-list is traversed, the information will be appended at the end of the list. If there is no element in the expert hub center, we feed the previous extracted feature into the neural network. If there are elements in the expert hub center, we feed the last layer of the expert hub jointed with the previous extracted feature into the neural network. e idea that the random forest is improved to the gradient boosting tree mainly occurs in the expert part.

Expert Network Improvement Part.
We set up an expert output, which is used to store the prediction score of each expert. If it is the first expert, the receiving input is the original feature. If it is a latter expert, the receiving input is the original feature and the prediction value of the former expert. By this way, it is equivalent to increasing the number of feature columns. With the construction of neural network, it has no effect on the final outputs of expert.   Computational Intelligence and Neuroscience

Explicit Self-Attention Expert
Interaction. In the paper [21], the method of self-attention is used to interact among different features. Drawing lessons from this idea, we regard the output of different experts as abstract high-level features, and design an interactive network layer. As is shown in Figure 5, on the basis of MMOE, we add an expert interaction layer, using a multi-head attention mechanism. e output after the interaction is used as a high-order feature. We employ an inner product operation between expert output and the high-order feature, and feed the results into the tower network of each task. By automatic interaction, the knowledge could be learned from experts to mine the user interests better.
Specifically, we adopt the key-value attention mechanism to capture combination among different experts. Taking the expert m as an example, we define the correlation between expert m and expert k under a specific attention head h as follows: where f h (·) is an attention function, e m denotes the expert m, and e k denotes the expert k; in this work, we employ inner product as attention function.
f h e m , e k � 〈W (h) query e m , W (h) key e k 〉, where W (h) query and W (h) key are transformation matrices that map the original expert space into a new space. W (h) value is the value space matric, and e (h) m is vector of expert m (under head h); furthermore, we combine h head as expert-output.
Feature-level multi-head self-attention is introduced to feature engineering, and then input to the expert network. e result is worse than the expert-level mode, so we choose the better one.

Deep Interest Sequence Feature Applied into Multitask
Learning. e improved MMOE_DIN model introduces the sequence feature to bottom layer. e sequence feature can capture the correlation of user's behavior better. e underlying features are processed by the way of deep interest network. On the basis of user sequence features, we design the embeddings, which represent spatial information and time information. e spatial information embedding method is as shown in Figure 6. e time-embedding information method is as shown in Figure 7.
We normalize the timestamp as days, and make some mathematic operations. e mathematic operations include exponential function operation, sine function operation, cosine function operation, root operation, square operation, and logarithmic operation. And then, we concatenate them into a large embedding vector.

Improve Loss Function with Multitask Learning.
Recently, artificial intelligence is gradually developing from the perceptual intelligence to cognitive intelligence. Deep learning is the mainstream technology in the recommendation system rank stage. More and more scholars [22,23] try to introduce cognitive intelligence into recommendation. Recommendation system has multiple scenarios, and the data is heterogeneous. Traditional multitask learning joint training requires data feature to be aligned. Combining heterogeneous data from multiple scenarios to train model, we propose a feature space mapping operator. e above operator can project the heterogeneous data into the same feature space via processing multiple network layers. From the perspective of  Computational Intelligence and Neuroscience cognitive intelligence, it is easier for multiple experts to share collective wisdom in the same feature space. e data cognition fusion scheme is as shown in Figure 8. For the cognitive learning of multitask shared parameters, we design a custom loss function. In the learning process, the features extracted from the current data source are regarded as real data, and its label is set as real label. e features extracted from the other data sources are regarded as fake data, and the corresponding labels are set as fake label. In this way, in multitask learning, with the multisource feature iteration training, the discriminator is difficult to distinguish the shared data sources, so as to achieve the shared cognitive effect. e multitask learning model makes feature space mapping for the data from different sources so that the multisource data are in the same feature space. We construct the following cognitive loss function, where c k i is real or fake label, and add it to basic loss function.

Experiment
In this section, we evaluate the performance of our proposed novel model on the public Ali-CCP data. Experimental comparison shows the effectiveness of our model, which outperforms the state-of-art methods for multitask learning. section. e labels consist of click label and conversion label. e features consist of feature field id, feature id, and feature value. Features include user features, item features, combination features, and context features. e data detail instruction is in the page below (https://tianchi.aliyun.com/ dataset/dataDetail?dataId�408&userId�1). We randomly select 10% of the train dataset as the validation dataset to test the evaluate index of all models.

Baseline Models.
We compare our proposed model with the following baseline and mainstream models: MLP [24]. We use the Multi-Layer Perceptron structure as our baseline, which is a single-task model. Shared Bottom [25]. e model with Expert-Bottom pattern shares several low-level network layers for all the tasks, and each task has its own tower. ESMM [4,26]. e model with Probability-Transformer pattern is used to predict the post-click conversion rate, which can relieve the sample selection bias problem via training on the entire space. OMOE [2]. e model with Expert-Bottom pattern integrates experts by sharing one gate among all tasks. MMOE [2]. e model with Expert-Bottom pattern integrates experts by multiple gates among all tasks. CGC [7]. e model with Expert-Bottom pattern separates task-shared experts and task-specific experts, which is designed to solve the multitask negative transfer problem. PLE [7]. e Progressive Layered Extraction (PLE) with Expert-Bottom pattern, and is made up by multilayer CGC.    Figure 9: e AUC of different embedding dimension.

Computational Intelligence and Neuroscience
Using Ali-CCP dataset, we adopt a two-layer MLP network with DICE activation and hidden layers for each task in both MTL models. Hyperparameters turned are shown in Table 1.

Experiment Setting
Hyperparameter Study In order to study the effectiveness of hyperparameters, we try the random search, grid search, and anneal methods.
(1) Considering the category embedding dimension, we do experiment by varying embedding dimension [8,16,32,64,128,256,512,1024], and the results are shown in Figure 9. We can see that the effects of the model are slightly affected by the embedding dimension. e embedding dimension is related to the model complexity and volume. Smaller embedding dimension leads to fitting the data distribution insufficiently, while larger embedding dimension increases the model complexity; proper embedding dimension will produce the best effect. Making a trade-off between the fitting ability and complexity, we finally select embedding dimension � 32 in all the experiments. (2) We study the impact of export weight; there is seesaw phenomenon among two different tasks. However, the export weight brings improvement overall the performance. We finally set the export weight of click task as 0.8 and order task as 0.2. (3) We study the impact of epoch number in the dataset and report the AUC performance on the entire test dataset as shown in Figure 10. We finally set the epoch number as 5 in all the experiments. (4) We study the number of layers in our proposed model; the effectiveness of AUC and log-loss is as follows. As the number of neural network layers increases, the AUC first increases then decreases and the log-loss is opposite trend. erefore, we finally choose 3 layers in all experiments, which is as shown in Figure 11.

Experiment Results.
Compared with the baseline MMOE, ESMM, and CGC, we demonstrate the effectiveness of our approach on Ali-CCP public dataset. We show that the proposed method improves the accuracy of multitask models. Offline evaluator of our model brings significant improvement. In order to obtain accurate prediction results, we repeat experiments 5 times for each model, among which the best offline effect is shown in Table 2.
To evaluate the effectiveness of our proposed model, we adopt four widely used metrics in experiments, i.e., AUC, Log-loss, CLICK@2, and ORDER@2.
AUC: area under curve, which reflects the ranking ability. e score ranges from 0 to 1, and the higher the better. e AUC formula is as follows: where D + and D − denote the set of positive and negative samples, |D + | and |D − | mean the number of samples in D + and D − , f(.) is the prediction function, and I(·) is the indicator function.
Log-Loss. In multitask learning, a common equation of joint log-loss is the weighted sum of the individual task log-loss.
where K is the number of tasks, L k (·) is the loss function, w k is the loss weight, and θ k is the task parameters. where y k denotes the real label, y k denotes the predict value, and sigmoid is the activate function.
CLICK@2. It is the probability of actual click number in the prediction top N score.
ORDER@2. It is the probability of actually buy number in the prediction top N score.
ORDER@2 � top y n n N , where n denotes the number of real click/buy sample in the top N score, and Nequals 2 in our paper. In order to reduce the accidental error of the experiment, we repeat the training process of each improved model for 5 times. Table 3 shows the average increase of 5 times for each model.
As mentioned above, in order to increase the credibility of the experiment, we repeated the training process 5 times for each model.
Custom evaluation indicators: In order to compare model effects more fairly, we evaluate models from multiple perspectives. Besides AUC, we customize two categories of offline evaluation indicators: CLK@N and ORD@N.
CLICK@N: In the top N commodities recommended by the model, the proportion of the number of commodities which the users click.
ORDER@N: In the top N commodities recommended by the model, the proportion of the number of commodities which the users purchase.
In order to reduce the accidental error of the experiment, we repeated the training process of each improved model for 5 times. Table 4 shows the average of 5 custom evaluation for each model.   From all the above tables, we can see that our methods bring positive improvements. Tables 2-4, comparing to base MMOE, we can see that every proposed point has improvement. Sequence feature can bring +3.65% in AUC due to the feature engineering improvement. Coarse-grained attention can bring +3.41% in AUC, and fine-grained attention can bring +2.11% in AUC. Coarse-grained and finegrained are two patterns of attention methods. We choose coarse-grained considering the fine-grained attention may lead to overfitting. Boosting expert mode and auto interact layer mode are all used to describe the expert interaction, and we select the auto interact layer because it performs better. Furthermore, we improve the loss function to better support multisource datasets feeding, and the model structure is more generic. Finally, we integrate the above four methods, and the prediction effect is significantly improved. click@2 and order@2 of each model are shown as in Figure 12. e experiment is repeated 5 times and the error fluctuation is small. It can be seen that our new integrating model has the best effect.

Conclusions
In this paper, we propose five improvement methods about the multitask learning, which focus on the expert interaction and gate attention mechanism. In the public data set, there is a significant improvement comparing with the MMOE model. We optimize the gate network, which relies on introducing the coarse-grained and fine-grained attention mechanism. By a linear transformation, the gate network of native MMOE pays more attention to the expert using the original input, so the expression ability is insufficient. We calculate the weights of the gate using the attention mechanism. We upgrade the calculation of the gate network, which is from a linear transformation to multiple matrix inner product operations. We introduce the gradient boosting tree in the MMOE experts, which improve both the knowledge representation and the efficiency of mutual communication reasoning. Multihead attention is applied on the expert feature extraction layer, which can represent high-order features better. In addition, we fuse sequence DIN and MMOE, which make the multitask learning consider the relevance of features.
In further work, we will introduce cognitive intelligence in multitask learning more. e cognitive intelligence can give full play to the wisdom of experts. Expert system based on frames and expert system based on models are regarded as different experts in the multitask learning algorithm. We will build a broader recommendation system, which use multi-experts and multitask to work collaboratively.

Data Availability
e Ali-CCP public dataset has been used in the experiments. Ali-CCP dataset is a public dataset containing 84 million samples extracted from Taobao's Recommender System. CTR and CVR (Conversion Rate) are two tasks modeling actions of click and purchase in the dataset. e dataset url is https://tianchi.aliyun.com/dataset/dataDetail? dataId�408.

Conflicts of Interest
ere are no conflicts of interest regarding the publication of this paper.

12
Computational Intelligence and Neuroscience