XGBDeepFM for CTR Predictions in Mobile Advertising Benefits from Ad Context

(e problem of click-through rate (CTR) prediction in mobile advertising is one of the most informative metrics used in mobile business activities, such as profit evaluation and resource management. In mobile advertising, CTR prediction is essential but challenging due to data sparsity. Moreover, existing methods often have difficulty in capturing the different orders of feature interactions simultaneously. In this study, a method was developed to obtain accurate CTR prediction by incorporating contextual features and feature interactions. We initially use extreme gradient boosting (XGBoost) as a feature engineering phase to select highly significant features. (e selected features are mobile contextual attributes including time contextual, geography contextual, and other contextual attributes (e.g., weather condition) in actual mobile advertising situations. Our model, XGBoost deep factorization machine(FM-) supported neutral network (XGBDeepFM), combines the power of XGBoost for feature selection, FM for two-order cross feature interaction, and the deep neural network for high-order feature learning in a united architecture. In a mobile advertising condition, our methods lead to significantly accurate CTR prediction in “wide and deep” type of model. In comparison with existing models, many experiments on commercial datasets show that the XGBDeepFM model has better value of area under curve and improves the effectiveness and efficiency of CTR prediction for mobile advertising.


Introduction
e task of click-through rate (CTR) prediction is crucial in advertising and recommendation areas; its main goal is to maximize the clicks to improve advertising revenue or user satisfaction [1][2][3]. In advertising area, CTR is an important indicator for measuring the effectiveness of advertising displays [4]. Advertiser's revenue relies heavily on the capability of CTR prediction. In recommendation area, the recommended items returned to users can be ranked by the predicted CTR [5].
is predicted probability helps recommendation systems know the users' interest on specific items such as news [6,7], movies [8], tags [9], or commercial items [10], which influence the subsequent decision-making [10]. Recommendation solutions can be classified in terms of collaborative, content-based, knowledge-based, demographic, and hybrid [11]. Each strategy can benefit from the task of CTR prediction [12].
One of the core problems that mobile advertising strives to solve is providing the right ads to the right people at the right time and in the right context. Users' attention time has been greatly reduced; thus, no one has the time to watch useless and intrusive advertisements. Nevertheless, the answer may be in the hands of marketers, especially in a dynamic mobile world. Mobile contextual advertising is not only about finding the right users in the right context, including time, geography, and weather, but is also about connecting the advertisement with the user in ad context and providing a pleasant experience. Accurate CTR prediction is vital to marketers based on contextual features. Another key challenge for CTR prediction is learning low-and highorder feature interactions behind user behavior in a certain context. Some feature interactions are easy to capture. Loworder feature interactions (less than two orders) can be designed by experts' prior experience. However, high-order feature interactions can be difficult to understand. ese deep-feature interactions can be learned by deep neural networks (DNNs). At present, researchers have proposed different methods of CTR prediction. Some ideas attempt to solve the twoorder feature interactions. For example, a logistic regression (LR) model has been used to predict the CTR on Google Ads [13]. Factorization machines (FMs) have been used to consider two-order feature interactions [14]. In recent years, DNNs have been popular because of their capability to learn high-order feature interactions. For example, Zhang et al. studied feature representations and proposed the FM-supported neutral network (FNN) [15]. Qu et al. proposed the product-based neutral network (PNN), which learns highorder feature interactions by introducing a product layer [16]. Cheng et al. combined "wide" and "deep" (W&D) components in a W&D model for low-and high-order feature interactions [2]. e remainder of this paper is organized as follows. In Section 2, the extreme gradient boosting (XGBoost) deep FM-supported neutral network (XGBDeepFM) is proposed, considering the problems in contemporary research and the characteristics in the mobile advertising dataset. In Section 3, the experiment is designed, a comparison experiment is performed for CTR prediction, and the effectiveness and efficiency of the XGBDeepFM model are analyzed. Finally, Section 4 concludes the study.

Materials and Methods
According to the core challenge of computational advertising proposed by Border [3], the best match between a given user in a given context and a suitable advertisement should be determined. ((a i , u i , c i ), y i ) denotes the instance i of the dataset, where a i denotes the advertisement, u i represents the user, c i denotes the context, and y i ∈ 0, 1 { } is the label of the clicking label.
We propose a united approach, namely, the XGBDeepFM model, which benefits from prior information from context and high-order feature interactions, as shown in Figure 1.
Our approach consists of three components, namely, the XGBoost, FM, and deep components. By using these three components, the proposed XGBDeepFM model can realize the full interactive combination modeling of the bilateral features (i.e., ad and user features) and contextual features.
y denotes the predicted CTR, and y XGBoost , y FM , and y DNN are the outputs of the XGBoost, FM, and deep components, respectively.

XGBoost Component.
e XGBoost component is a scalable machine learning system for tree boosting [2]. In XGBoost, feature selection and combination are automatically performed to generate new discrete feature vectors as the input of the LR model. e depth of a decision tree determines the dimension of the feature intersection. For example, if the depth of the decision tree is four, then the final number of the leaf node is the number of orders (three order) of feature interactions. We use XGBoost to capture three-order feature interaction and perform feature selection among features. e objective function of the XGBoost is as follows: (2) XGBoost uses the following forward distribution algorithm: where y (t) i is the predicted value of the time t of the iteration, that is, the predicted result of sample x i by t trees and y (t−1) i is the predicted value of the current (t − 1)th iteration. us, when the model is initialized, the model has no tree, and the predicted result is a constant. Each iteration adds a new tree to the model, and the loss function then changes correspondingly. In addition, the training of (t − 1)th trees is completed when the (t)th tree is added.

FM Component.
e FM component is used to learn feature interactions [1]. FM models can capture two-order feature interactions as the inner product of respective feature latent vectors.
where v i , v j denotes the latent vector. Each cross-term parameter w ij is expressed by the inner product 〈v i , v j 〉 of the latent vector. e objective function of FM is as follows:

Deep Component.
e deep component is used to learn high-order (more than three orders) feature interactions. e original features are initially embedded such that the features of different fields are mapped to the same dimension of the embedding space. Similarly, the dimension of the implicit vectors is k. Here, we set two layers for the deep component, and the entire DNN component is then computed as follows: where e m denotes the embedding of discrete features and y DNN is the prediction of DNN for the CTR of mobile 2 Mathematical Problems in Engineering advertising. After the feature selection of the XGBoost, the FM and deep components share the same feature embedding.

Datasets.
e dataset used in this study contains the O2O mobile ad data from a mobile Internet platform, which provides users with local life service information. e dataset covers offline scenes, such as catering, supermarkets, convenience stores, takeout, beauty salons, and cinemas. e platform not only provides rich ad information but also offers users' explicit and implicit behavior information. Such abundance of data brings great convenience in CTR prediction. e original experimental dataset contains attributes such as shop information, users' payment log, and users' browsing log in 2016 (see Tables 1 and 2).
Mobile contextual ad CTR has a significant relationship with weather. us, we also crawl weather information from a weather platform named http://WunderGround.com. e platform is a reliable source of historical weather forecast information on a global scale. In this study, 4,369,918 precise historical weather data of 122 cities on day and hour levels are crawled, as shown in Tables 3 and 4, respectively.

Evaluation Metrics.
We use the area under the ROC curve (AUC) as our evaluation metric because it is not bias on the size of test or evaluation data. AUC measures the likelihood that given two random points, one from the positive and one from the negative class, the classifier will rank the point from the positive class higher than the one from the negative one. e larger the AUC is, the more accurate the CTR prediction of mobile advertising will be.

Feature Selection by XGBoost. A benefit of using
XGBoost is that, after the boosted trees are constructed, importance scores that indicate how useful or valuable each feature is in the construction of the boosted decision trees within the model can be easily obtained. us, we choose XGBoost because this model is easily interpretable by human experts. Moreover, the depth of the decision tree can decide the order of feature interaction, which can make up for the FM and DNN components. We plot the feature importance calculated by the XGBoost model, as shown in Figure 2. e ranking results of the feature importance show that contextual features are of high importance. For example, from the perspective of temporal features, the week ranks fourth in the importance of the model; from the perspective of geographical features, geographical location features rank ten; and from the perspective of temperature contextual features, pressure and body temperature, which are important features, rank second. is feature selection focuses on the integration of mobile ad bilateral factors (i.e., ad and user factors) and contextual factors, as shown in Table 5.

Model Comparison.
We initially compare the performance of each component of the model (i.e., first-order linear, second-order FM, DNN, and XGB components) and its combination under the optimal settings. en, we compare our proposed method with other models, namely, W&D, FNN, PNN, and XDeepFM. Figure 3 presents the AUC results of in-model comparisons. We compare the predictive performance of different models and observe the following. First Figure 4, XGBDeepFM is superior to all depth CTR models in terms of the AUC index. XGBDeepFM is 0.0015, 0.0003, 0.0003, and 0.0001 higher than W&D, FNN, PNN, and XDeepFM, respectively. Overall, the following results are obtained (Figures 3 and 4): (1) Learning feature interactions instead of learning only linear features improves the performance of CTR prediction (2) Learning low-and high-order feature interactions simultaneously contributes to CTR prediction (3) Learning more important features based on the XGBoost model can improve the performance of a CTR prediction model Figure 5 shows the comparison results of the convergence time of different models. e results show that the convergence speed of the XGBDeepFM algorithm is faster than that of PNN and FNN, only next to XDeepFM; the loss is the lowest after the 10th round of training.

Conclusions
In CTR prediction, the contextual features and interactions among ad, user, and contextual features are key factors that can affect the prediction performance. In this study, we propose the XGBDeepFM model. We initially include information on contextual features to improve the prediction accuracy from the perspective of time, geography, and weather. en, a feature selection process is conducted to obtain important features. Low-and high-order features are obtained using the proposed XGBDeepFM model. We conduct extensive experiments to compare the effectiveness and efficiency of XGBDeepFM with other methods. Our experiment results demonstrate that (1) XGBDeepFM outperforms the state-of-art models in terms of AUC, and (2) the efficiency of XGBDeepFM outperforms most deep neural network models.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.