Multifeedback Behavior-Based Interest Modeling Network for Adaptive Click-Through Rate Prediction

With the rapid development of the Internet, the recommendation system is becoming more and more important in people’s life. Click-through rate prediction is a crucial task in the recommendation system, which directly determines the eect of the recommendation system. Recently, researchers have found that considering the user behavior sequence can greatly improve the accuracy of the click-through rate prediction model. However, the existing prediction models usually use the user click behavior sequence as the input of the model, which will make it dicult for the model to obtain a comprehensive user interest representation. In this paper, a unied multitype user behavior sequence modeling framework named as MBIN, a.k.a. multifeedback behavior-based Interest modeling network, is proposed to cope with uncertainties in the noisy data. e proposed adaptive model uses deep learning technology, obtains user interest representation through multihead attention, denoises user interest representation using the vector projection method, and fuses the user interests using adaptive dropout technology. First, an interest denoising layer is proposed in the MBIN, which can eectively mitigate the noise problem in user behavior sequences to obtain more accurate user interests. Second, an interest fusion layer is introduced so as to eectively model and fuse various types of interest representations of users to achieve personalized interest fusion. en, we used auxiliary losses based on behavior sequences to enhance the eect of behavior sequence modeling and improve the eectiveness of user interest characterization. Finally, we conduct extensive experiments based on real-world and large-scale dataset to validate the eectiveness of our approach in CTR prediction tasks.


Introduction
With the gradual development of the Internet, users are often confronted with a huge amount of information. In order to reduce information overload and meet users' diverse online service needs (e.g., e-commerce, short videos, and news), personalized recommendation systems have become increasingly important. After decades of development, industrial recommendation systems have now basically formed a two-stage funnel-type structure consisting of a recall phase and a ranking phase [1][2][3][4]. Click-through rate prediction, as the most central module in the ranking phase, usually directly determines the recommendation results and plays a signi cant role in user satisfaction and user dwell time [4,5]. In addition, online advertising is the core product of major Internet companies, and the revenue of their advertisements is directly related to the accuracy of clickthrough rate estimation [1,2]. erefore, click-through rate prediction, as a common task in recommendation systems, has been widely studied in industry and also attracted the attention of many scholars in academia.
In recent years, inspired by the application of deep learning in elds such as computer vision and natural language processing, more and more work has introduced deep learning into click-through rate prediction models with remarkable results. At present, click-through rate prediction models based on deep learning can be broadly classi ed into three categories: (1) feature interactions-interactions between features are realized with the help of deep learning techniques [6][7][8][9], such as Wide&Deep [10] and AutoInt [11], and other models that use deep learning methods to extract feature interactions so as to improve the prediction accuracy of click-through rate prediction models; (2) user behavior sequence modeling-through RNN, the Transformer temporal modeling module is used to model user behavior sequences [1,2,4,12], from which more accurate representations of user interests are extracted, and thus improve the effectiveness of the model. In this paper, we focus on user behavior sequence modeling, and the difficulty lies in how to extract more accurate user interests from real user behavior data and how to improve the model prediction based on user interest representations.
User behavior sequences usually reflect users' interests [4,12,13], and the mining and utilization of user behavior sequences is usually the most important module in clickthrough rate prediction and even the whole recommendation system, which makes exploring how to model based on users' historical behaviors a very important research direction in recommendation systems. In recent years, many works at home and abroad have attempted to model user behavior sequences [1,2,4], and some progress has been made. However, the existing works have only used historical user click behavior sequences for modeling [1,2,4], without considering other types of user behavior sequences. Recently, researchers have pointed out that user click behavior sequences alone are not sufficient to accurately extract user interest representations, and the behavior sequences of user disinterested feedback also represent user preferences to some extent [13,14]. In most cases, user interaction data are with multiple feedback types, which generally include implicit positive feedback (e.g., click), implicit negative feedback (e.g., no click), and explicit positive feedback (e.g., like and buy), which are important for characterizing users' unbiased interests [13,14]. erefore, how to model the behavior sequences of multiple feedback types of users is an important concern for the next-generation click-through rate prediction model based on user behavior sequence modeling.
In this study, the feedback data of users are divided into two categories: (1) implicit feedback-feedback behaviors that can reflect users' interest/disinterest to some extent, with more but inaccurate data, including click and no-click feedback; (2) explicit feedback-feedback behaviors that can reflect users' interest/disinterest very accurately, with less data, including rating, marking (like and dislikes), and other behaviors.
Implicit feedback usually contains a large amount of noise; e.g., the user may accidentally click on an item that is not of interest, or the user may not click on an item of interest because the slide is too fast. erefore, how to denoise the data in implicit feedback is a key issue to improve the effectiveness of implicit feedback behaviors. Existing works on multitype behavior sequences [13,14] only consider the modeling of multitype behavior sequences without considering the noise problem in user behavior sequences, and also do not consider the fusion problem of multitype behavior sequences, which will make the model unable to achieve the best results. Based on the above observations, this paper proposes a new model that uses multitype user behavior feedback, thus improving the accuracy of the click-through rate prediction task. Specifically, in order to mitigate the impact of noisy behaviors in the user implicit feedback behavior sequences on the click-through rate model, this paper creatively introduces the user explicit behavior sequences into the clickthrough rate prediction model and uses the explicit feedback behavior sequences to denoise the implicit feedback behavior sequences, thus improving the effectiveness of user interest modeling. In addition, this paper designs a user interest interaction module to realize the adaptive fusion of user interest representation. e effectiveness of the proposed method is demonstrated by conducting sufficient experiments on a large-scale real dataset, and it is verified that the method can be extended to other models that require user behavior sequence modeling. e main contributions of this work are presented as follows: (i) A new framework for modeling user cancel-click behavior, MBIN, is proposed, which effectively combines the respective types of users and enriches the modeling of user interest representation (ii) We propose an interest denoising layer that can effectively mitigate the noise problem in user behavior sequences and extract accurate user interests from implicit negative feedback (iii) We propose an interest fusion layer that can effectively model and fuse various types of interest representations of users to achieve personalized interest fusion (iv) We have conducted numerous experiments on realworld and large-scale datasets to validate the effectiveness of our approach in CTR prediction tasks

Click-Through Rate Prediction Model for User Behavior Sequence Modeling
e goal of click-through rate prediction models is to predict the real click probability of users on items, usually using deep neural network models based on binary classification. Specifically, such models input user attribute features (e.g., user ID, gender, and age), item attribute features (e.g., item ID and belonging category), and user behavior features (e.g., user behavior sequences) into the model after embedding layers, and then model them using structures such as multilayer perceptron (MLP) with nonlinear activation functions, and finally obtain a prediction score between 0 and 1, which is the predicted click-through rate of the item by the model.

User Behavior Modeling
e basic idea of user behavior sequence modeling is to use deep learning-based neural networks to model user historical behavior sequences in order to extract more accurate representations of user interests and thus improve the accuracy of various model predictions, including clickthrough rate prediction models. Click-through rate prediction models based on user behavior sequence modeling have been favored by researchers related to industry as the most commonly used models in industry, and have also received attention from academia. Commonly used clickthrough rate prediction models based on user behavior sequence modeling include the deep interest network [1] (DIN), the deep interest evolution network [2] (DIEN), and the deep matching and ranking network [15] (DMR). All such models use user historical click-through behavior sequences for modeling. Most of these models add user historical behavior sequence features to the original click rate prediction model, and use sequence modeling modules such as recurrent neural network (RNN) and attention mechanism (Transformer's model) to model user historical behavior, and input the user interest representation vector extracted by the model into the model after splicing and fusing with other feature vectors. e subsequent modules are used to obtain the final estimated click-through rate.
Usually, user behavior sequence modeling focuses on implicit positive feedback behaviors (clicks), while ignoring the rich but noisy implicit negative feedback behaviors (unclicks), precise but sparse explicit positive feedback behaviors (likes), and explicit negative feedback (dislikes), which are equally important for a comprehensive and accurate understanding of user interests. At present, there are very few research works in the area of considering users' multiple feedback in click-through rate prediction models at home and abroad. For example, Xie et al. proposed a novel deep feedback network [16] (DFN) to capture users' unbiased interest through their click, unclick, and dislike feedback. However, the existing methods have failed to consider the effect of noise in users' implicit negative feedback on user interest extraction, which makes it difficult for existing methods to achieve the best results. erefore, this paper proposes the multifeedback behavior-based interest modeling network (MBIN) framework, which can effectively model multiple types of user behavior sequences and improve model prediction by removing the noisy data from users' implicit feedback through the user interest denoising module.

Methodology
We describe the technical details of MBIN in this section, which jointly utilizes multiple types of user historical behavior sequences for CTR estimation, including user implicit positive/negative feedback and explicit positive/ negative feedback. Similar to the multifeedback in most other scenarios, we set click, unclick, and like behavior sequences as implicit positive, implicit negative, and explicit positive feedback, respectively. Figure 1 shows the overall architecture of MBIN. First, we present the interest denoising layer and the explicit interest contrastive layer, which are designed to handle the challenge of removing noise from implicit negative feedback (unclick) and obtaining adequate explicit positive feedback supervisory signals for long-tail items. Furthermore, we construct a user interest fusion layer to get user interests from behavior sequences and realize the adaptive fusion of various feedback embeddings. e details of each component are as follows.

Embedding Layer.
e input of MBIN can be divided into three groups: user profile, target item, and three categories of user behavior sequences. e user profile is represented by P, including corresponding features such as user id, age, and gender. We project the target item into I, which refers to the candidate item for prediction and includes item id and brand id. In this paper, we model user multifeedback historical behavior sequences, which are lists of items and contain click sequenceC where T is the maximum sequence length input to the model.
Since the features of the above model are categorical features, we use the embedding layer to transform all features to low-dimensional dense features. We denote e p and e i as the embedding vectors of the user profile and target item, respectively. Simultaneously, we project C, U, and L into the embedding space and obtain the output low-di- It is significant to mention that target item and three different behavior sequences use independent embedding layers, which usually achieve better performance [1].

Interest Extraction Layer.
User historical behavior sequences can reflect users' interest tendencies, and modeling the behavior sequences can effectively capture users' interest representations, thus improving the accuracy of model prediction.
Specifically, the module mainly uses multihead target attention (MHTA) [3] for interest extraction of four different user behavior representations, using clicked behavior representations, unclicked behavior representations, liked behavior representations, and disliked behavior representations as inputs to MHTA, which can obtain clicked interest representations I c , unclicked interest representations I u , like interest representation I l , and dislike interest representation  Similarly, the unclicked interest representation I u , the liked interest representation I l , and the disliked interest representation I d can be obtained.

Interest Denoising Layer.
e clicked interest representation I c and the unclicked interest representation I u in the implicit user feedback contain a lot of noise, and the purpose of the denoising layer is to denoise these two representations to obtain a cleaner vector of user interest representations.
In this paper, we believe that the noise in the click behavior sequence is caused by the user's misclick behavior on the uninterested items, while the noise in the unclick behavior sequence is caused by the user's failure to click on the interested items. Based on the above view, the implicit feedback can be denoised using the explicit feedback from users, i.e., denoising the unclicked interest behavior representation using the liked interest representation and denoising the clicked interest representation using the disliked interest representation. erefore, two sets of orthogonal mapping pairs, <unclick, like> and <click, dislike>, are constructed in this paper, and their corresponding combinations of sequence representations are <I u , I l > and <I c , I d >, respectively.
Speci cally, the mapped representation e p u of e u in the direction of e l is a mixed representation of the user's unclicked preference and preferred preference, which is the noise of the sequence representation e u of unclicked behavior, because this representation matches the user's interest but appears in the unclicked behavior representation. In other words, by removing this postprojection representation, the denoised user unclicked behavior representation can be obtained. Similarly, the user's click behavior after denoising can be obtained. e speci c formula is as follows: project I u , I l I u · I l I l I l · I l I l ,

Interest Fusion Layer.
After the interest extraction layer and the interest denoising layer, pure user click interest representation, nonclick interest representation, like interest representation, and dislike interest representation can be obtained. Due to the dispersion of user behavior, some users may often mark items as like or dislike, while some users rarely mark them, and even some users' click behavior is very sparse. Based on this observation, this paper creatively proposes an interest fusion layer based on dynamic dropout [17]. is module can capture the differences of user behavior distribution, effectively fuse user interests, and avoid overfitting the model to a certain kind of behavior representation. e core idea of dropout is to randomly discard a certain proportion of neurons in the process of model training, so as to reduce the risk of model overfitting. is paper believes that the huge difference in user behavior distribution may lead to the overfitting of the model to a certain kind of behavior representation, so an interest fusion layer based on dynamic dropout is designed. e model takes four real lengths as inputs of monotonic function to obtain four dropout ratios. e specific implementation of monotonic function is as follows: where s is the true length of the sequence, θ 1 and θ 2 are the hyperparameters controlling monotonicity and slope, and p(S) is the resulting dropout ratio. en, the four dropout ratios are taken as the real dropout probabilities of the four behavior sequences, so as to realize the adaptive fusion of four different types of behavior sequences in the training stage and slow down the impact of overfitting on the model. Specifically, four pure user interest representations are spliced after adaptive dropout to obtain the fused user interest representation I f . e calculation formula is as follows:

Output Layer.
After passing through the interest fusion layer, the unbiased fusion interest representation I of four user behavior sequences can be obtained (I f ). en, the fused user interest representation, user attribute representation, and item representation are spliced, and then output through sigmoid function after passing through multiple layers of full connection layers with activation function to obtain the user's predicted click-through rate of items as shown in the following formula: where y is the click-through rate predicted by the model.

Loss Function.
When training the MBIN model, there are two optimization objectives: (1) When estimating the click-through rate of the target item, it is necessary to make the estimated value of the model close to the real value as much as possible; and (2) it is necessary to ensure that the interest representation obtained from the four behavior sequences of the model is as close as possible to the user's real preferences of click, no click, like, and dislike. For objective (1), this paper optimizes by minimizing the crossentropy loss function: where D represents the training dataset of size N, x represents the input sample of the MBIN model, y represents the predicted value of the model, and y ∈ 0, 1 { } represents whether the user has clicked on the item or not.
For objective (2), it is difficult to learn the desired results from the backpropagation point of view just by whether the user has clicked or not as a supervised signal. In order to enable the model to learn a representation that is more consistent with the real interest distribution of the user, this paper proposes an auxiliary loss-based method to help the user behavior sequence to be learned.
Specifically, for a sequence of real behaviors of length S, MBIN uses the first S − 1 user behaviors to predict the user's S − th behavior, which is a mega multiclassification task, and the number of candidate items is the number of model classifications. Taking the user click behavior sequence as an example, the behavior sequence representations in the interest extraction layer after multihead target attention modeling are obtained, and the interest representations I S− 1 c represented by the first S − 1 user behavior representations are averaged over the first S − 1 user behavior representations to obtain the user's first S − 1 click behavior. e probability of the user's next click on item s after this S − 1 behavior can be defined by the softmax function as follows: where v S c is the characterization of the S − th click item. Taking the cross-entropy as the loss function, there are the following losses: where t i t ∈ 0, 1 { } represents the label of the S − th item in the sequence of user click behaviors, p i S is the corresponding prediction result, and K is the number of multiple classifications, i.e., the number of items. y i S � 1 when and only when item s is the S − th behavior in the sequence of user click behaviors. Considering that the number of items is in the order of millions, which leads to too much computation for the softmax classification, the negative sampling technique [18] is used to simplify the computation, and the corresponding loss becomes as follows:

Mobile Information Systems
where v h C is the representation of negative samples obtained from negative sampling, and H is the number of negative samples sampled, whose value is much smaller than the total number of overall item H.
Similarly, the auxiliary losses L U NS , L L NS , and L D NS for unclicked behavior, liked behavior, and disliked behavior are obtained.
Finally, the overall loss of the MBIN model is defined as follows:

Experiments
In order to verify the validity of the proposed model in this paper, sufficient experiments are conducted in a real largescale dataset to evaluate the effectiveness of the MBIN model by comparing several state-of-the-art (SOTA) methods.

Data Description.
In this paper, we conducted sufficient experiments based on the Alimama display ad dataset, the details of which are listed in Table 1. e Alimama dataset (https://tianchi.aliyun.com/ dataset/dataDetail?dataId�56) consists of randomly selected ad display and click logs from Taobao over an 8-day period. It contains about 26 million logs, 1.15 million users, and 850,000 items. e logs of the first 7 days are used as training data, and the logs of the last day are used as test data.
is dataset contains data on multiple types of user feedback behaviors, and includes purchases and favorites as explicit positive feedback (likes) and multiple shows but no clicks as explicit negative feedback (dislikes). Since the Alimama dataset is collected from real advertising scenarios with large-scale and complete features, it is widely used to study various CTR prediction models [1,2,12].

Evaluation Metrics.
We evaluate the CTR prediction task in our dataset mentioned above. Specifically, the evaluation metric of our model is the area under the curve (AUC) of the receiver operating characteristic curve (ROC), which reflects the ranking ability of the model and is widely used in binary classification problems. It is worth noting that in real large-scale datasets, an AUC improvement of 0.1 is often a more substantial improvement, which usually leads to larger business gains. Furthermore, similar to DFN [16], we introduce RelaImpr to measure the improvement relative to the base model, which is calculated as follows: All models are tested at least 10 times to filter out the optimal hyperparameters and report the best experimental results.

Competitors.
In this paper, several state-of-the-art (SOTA) comparison schemes are implemented in the above real dataset, as follows: (1) Wide&Deep. It [10] consists of a wide part and a deep part, allowing the wide part to perform low-order feature interactions and the deep part to perform high-order feature interactions, thus enabling the model to learn both loworder and high-order feature interactions.
(3) DIN. It [1] uses the target attention mechanism to model the relationship between candidate items and the user's historical behavior sequence, which is widely used in behavior sequence modeling scenarios.
(4) DSIN. It [12] divides the user's historical click behavior sequence into multiple different sessions according to the temporal relationship of different click behaviors, and then uses Bi-LSTM to model the multiple sessions of the user to obtain the evolution of the user's interest.
(5) DIEN. It [2] utilizes a two-layer GRU to model user interest evolving process and extract user latent temporal interest.
(6) DMR. It [15] combines the idea of collaborative filtering in matching methods for the ranking task in CTR prediction with user-to-item network and item-to-item network, respectively.  In order to effectively model multiple types of user behavior sequences, DMT [14] used multiple different transformers to model the user's click sequences, collection sequences, and purchase sequences, and extracted the interest representations of multiple user behaviors.

Parameter Settings.
In this paper, all models mentioned in the paper were implemented using TensorFlow and experimentally evaluated using the exact same training and test datasets. e maximum length T of each behavior sequence is set to 50. In the embedding layer, the embedding vector dimension is uniformly set to 64. e model batch data size is set to 512, and the learning rate is set to 0.0005. e hyperparameters of each method are heavily tuned, and the best hyperparameters are finally selected as the experimental results. Table 2, the following conclusions can be obtained: First, the proposed method in this paper achieves the best results, and achieves a relative improvement of 10.62% compared with Wide&-Deep, which is a very significant improvement for real industrial-grade datasets. Second, by comparing the DIN, DMR, DFN, and DMT models, it can be obtained that using more types of user feedback behaviors can effectively improve the effectiveness of the models. Besides, both DFN and DMT models use different types of user behavior sequences, and their effectiveness is significantly higher than that of the DIN and DMR models, which only use user click behavior sequences.

Overall Performances. As shown in
e experimental results are also the starting point of this paper for modeling multiple different user behaviors-richer user behaviors, which can help the models to obtain more accurate user interest representations.

Effectiveness of the Interest Denoising Layer.
e interest denoising layer is designed to purify the presentation of implicit negative feedback with noise. In order to explore the effectiveness of using like representation to denoise unclick representation by denoising module, we design three different denoising modules for comparison, namely, MBIN-IDN, which removes the interest denoising layer; MBIN DUMN [19], which utilizes a vector project layer to get the vector representation after denoising; and MBIN XDM [20], which gets the representation purified by the confidence layer and triplet loss.
As shown in Table 3, the effect is significantly reduced after removing the interest denoising layer, indicating that the interest denoising layer in the MBIN model can effectively remove the noise in the implicit user feedback and improve the prediction accuracy of the model.

Effectiveness of the User Interest Fusion Layer.
In most scenarios, user interaction data have multiple feedback types, usually including implicit positive feedback (e.g., click), implicit negative feedback (e.g., no-click), explicit positive feedback (e.g., like and buy), and explicit negative feedback (e.g., multiple no-clicks), which are important for characterizing users' unbiased interests. From a modeling perspective, how the four different types of user interest representations are fused will have a meaningful impact on the effectiveness of click-through prediction tasks.
To evaluate the proposed interest fusion layer, we design a variety of comparison methods: MBIN CONCAT , which concatenates the implicit positive and negative feedback representations; and MBIN ATTENTION , which uses attention mechanism to adaptively fusion two types of user interest representations. According to the results shown in Table 4, we can find that the performance of our MBIN is significantly improved. Since the behavior distribution of implicit positive feedback and implicit negative feedback is quite different, it is difficult to obtain a better user interest representation by concatenating those heterogeneous interest representations, and so attention mechanisms are used.

Effectiveness of the Auxiliary Loss.
In order to verify the effectiveness of the auxiliary loss function, the auxiliary loss function is removed from the MBIN model for the ablation experiment. In this paper, we define MBIN-L as the MBIN model with all behavior sequences of auxiliary loss functions removed to verify the effect of behavior sequences of auxiliary loss functions on the model effect. Similarly, MBIN-LC, MBIN-LU, MBIN-LL, and MBIN-LD are the models of the MBIN model after removing the auxiliary loss functions of clicked, unclicked, liked, and disliked behavior sequences, respectively. e experimental results of the above models are shown in Table 5.
e above experiments show that the effect of the model after removing the auxiliary loss functions in the four behavior sequences shows different degrees of decrease, among which the effect of removing the auxiliary loss functions in the click behavior sequences is the most obvious. erefore, the auxiliary loss function designed in the MBIN model can effectively improve the ability of model interest modeling, and thus improve the model prediction accuracy.

Parameter
Analysis. In order to analyze the effects of some important parameters in the MBIN model, parameter sensitivity experiments were conducted for θ 1 and θ 2 in the fusion layer of interest, and the experimental results are shown in Figures 2 and 3.  From the gures, it can be found that the overall trend of the model both increases and decreases as the values of θ 1 and θ 2 are taken. e hyperparameter θ 1 determines the intercept of the monotonic function p(S) and reaches the best result when it takes the value of 20. e hyperparameter θ 2 determines the slope of the monotonic function p(S) and a ects the rate at which the dropout ratio changes with the length of the behavior sequence, reaching the best result when it takes the value of 0.2. Intuitively, after choosing the optimal hyperparameters θ 1 and θ 2 , the dropout ratio increases as the actual length of the user behavior sequence S increases; i.e., the feature is more likely to be "dropped"; i.e., the risk of over tting the model to the feature is reduced, thus making the model more robust.

Conclusions
In this paper, by introducing multitype behavior sequences of users into the click-through rate prediction model, the model is able to better capture user interests, thus improving the e ectiveness of click-through rate prediction and ultimately the accuracy of the recommendation system. en, by designing the interest denoising layer, the in uence of noisy data in the implicit user feedback on the user interest extraction process is mitigated, thus further improving the e ectiveness of the model. In addition, this paper designs an adaptive dropout-based user interest fusion layer, which can adapt to the variability of user behavior distribution and thus e ectively fuse multiple types of user interest representations. After that, an auxiliary loss function based on behavior sequences is designed to help the model obtain more accurate interest representations. Finally, this paper demonstrates the rationality of the proposed method through su cient experiments and veri es the e ectiveness of each layer through ablation experiments. Our method improves AUC by 10.62% and can provide users with more accurate recommendation results.
Data Availability e data used to support the ndings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no competing interests.  Figure 2: Impact of the parameter θ 1 .