A New Approach for Mobile Advertising Click-Through Rate Estimation Based on Deep Belief Nets

In recent years, with the rapid development of mobile Internet and its business applications, mobile advertising Click-Through Rate (CTR) estimation has become a hot research direction in the field of computational advertising, which is used to achieve accurate advertisement delivery for the best benefits in the three-side game between media, advertisers, and audiences. Current research on the estimation of CTR mainly uses the methods and models of machine learning, such as linear model or recommendation algorithms. However, most of these methods are insufficient to extract the data features and cannot reflect the nonlinear relationship between different features. In order to solve these problems, we propose a new model based on Deep Belief Nets to predict the CTR of mobile advertising, which combines together the powerful data representation and feature extraction capability of Deep Belief Nets, with the advantage of simplicity of traditional Logistic Regression models. Based on the training dataset with the information of over 40 million mobile advertisements during a period of 10 days, our experiments show that our new model has better estimation accuracy than the classic Logistic Regression (LR) model by 5.57% and Support Vector Regression (SVR) model by 5.80%.


Introduction
With the rapid development of mobile Internet and its wide applications, mobile advertising has become one of the most successful business models in the world. To achieve the best advertisement delivery, how to best estimate the Click-Through Rate (CTR) is the key and has thus become a hot research direction in the field of computational advertising.
Generally, the display of an online advertisement can be seen as a three-side game between media, advertisers, and audiences. The advertiser purchases the chance of advertising from the media through cost-per-click (CPC), so the expected revenue of the media is decided by the Click-Through Rate (CTR) and the CPC. The formula [1] is as shown as How to advertise to a specific audience group to benefit all three sides of the game process is the key problem in the field of online advertising. Inappropriate advertising will lead to the decreasing of user experience, especially in the domain of mobile advertising, with the increase of annoyance and waste of network traffic, and the advertisement will not achieve the expected effect and the media will suffer too. As a result, Click-Through Rate (CTR) estimation is the critical factor in this three-side game and has thus become a hot research direction in the field of computational advertising.
There are many models proposed by the academia and industry, which are usually based on the machine learning methods. We can divide them into three categories: linear, nonlinear, and fusion models. Chapelle [2] proposed a machine learning framework based on Logistic Regression (LR), aiming to predict the CTR for Yahoo website. Dembczyński et al. [3] adopted decision trees to discover the implicit relationships between different features and finally found out the nonlinear relationships between the predicted target and the features. He [4] developed a fusion model of LR and decision trees for the advertisement system of Facebook. The factorization based prediction method Field-Aware Factorization Machines [5] was developed by Juan et al.Tagami et al. [6] proposed a learning-to-rank approach on contextual advertising. Most of the existing prediction models have developed well and can get a satisfactory result after 2 Computational Intelligence and Neuroscience they are trained with massive training data whose features are chosen, built, and handled artificially, thus depending heavily on the experience and techniques of the data analysts.
Meanwhile, deep learning has become an important approach, especially after Hinton et al. [7] proposed a rapid level-by-level training process in 2006 for solving the difficulty of connecting the learning and training of the Deep Neural Network (DNN). With the help of this training process, it is now possible to train the neural network in a better way, and the neural network is widely used in multiple fields such as handwritten numeral recognition [7], 3D object recognition [8], and speech recognition [9].
The data of ads to be predicted is a kind of time-series data [10], which consists of sampled data points taken from a continuous, real-valued process over time. For the application of deep learning methods on time-series data, Chaturvedi et al. [11] presented a deep transfer learning method. The key idea is to learn high-dimensional network motifs from lowdimensional forms through a process of transfer learning. This method may greatly decrease the computational cost in the range of 25% to 600%. The approaches above provide reliable experience of applying the neural network to predict the Click-Through Rate.
Since the key features of advertisement data are not detected completely and the nonlinear relationship between different features is not fully reflected in current models, this paper proposes a mobile advertising CTR estimation fusion model based on Deep Belief Nets (DBNs). Our model combines the advantage of the powerful data representation and feature extraction capability from DBN with the advantage of simplicity of traditional Logistic Regression models. To solve the key problem of obtaining the best feature expression in machine learning, our model detects deep level features in place of simple original features and then puts it into a Logistic Regression model to predict the CTR. Numerous experimental results show that our model improves the estimation accuracy in an obvious way. It performs 5.57% and 5.80% better than the original LR model and Support Vector Regression (SVR) model in the evaluation criterion of Area under the Curves (AUC).

Deep Belief Nets
Deep Belief Nets (DBNs) are probabilistic generative models that are composed of multiple layers of stochastic, latent variables. In this part, we train our DBN by using Restricted Boltzmann Machines (RBMs) to initialize the parameters and then obtain the Backward Error Propagation (BP) Algorithm to adjust them.

Restricted Boltzmann Machines. Restricted Boltzmann
Machine (RBM) is a two-layer undirected graph model [12], whose structure is shown in Figure 1. It consists of one visible layer and one hidden layer. There are connections between layers but no connections between units in the same layer. The units in the hidden and visible layers can be two-value units or exponential families, for example, Gaussian, Poisson, and softmax. The tied weight and biases value define the probability distribution and energy function of the whole model, where V is the binary data vector of visible units and ℎ is the one of hidden units, so the energy function is as follows: Here, = ( , , ), stands for the undirected tied weight between unit V in the visible layer and unit ℎ in the hidden layer, and stand for V and ℎ 's biases values, and and are the count of units of the visible layer and the hidden layer, respectively. When there is a specific , we can get the marginal distribution of the status vector V of the visible layer from the energy function: The units inside one layer of RBM are not connected to each other, so when we know one specific unit in some layer, the active status of units in another layer is conditionally independent. As a result, we can calculate (V | ℎ) and (ℎ | V): The reason why we train the RBM is to calculate the value of parameter and fit the samples. In this paper, we obtain an unsupervised training method to maximize the Log Likelihood Function ( ) of input samples and use the stochastic gradient descent algorithm to get the Computational Intelligence and Neuroscience partial derivative of the parameters to update the tied weight between the visible layer and the hidden layer: In formula (6), ⟨⟩ data stands for the probability distribution of the status of the hidden layer under certain status of the visible layer, while ⟨⟩ model is the joint probability distribution of the visible layer and the hidden layer. The calculation of the joint probability distribution takes exponential time complexity, so we obtain the Contrastive Divergence Algorithm to get the approximation of the gradient by doing steps of Gibbs Sampler, as shown in (7) and Figure 2:

The Structure of Deep Belief Nets and Their
Training Structure. DBN is a probabilistic generative model that contains one visible layer and multiple hidden layers, as shown in Figure 3. The top two levels of DBN are nondirectly connected and are called the Associative Memory Layer. The rest of the layers are directly connected to one another. The downward edge is Generative Weights while the upward edge is Detection Weights. The training of DBN can be divided into two parts.
(1) To see every two layers of DBN as a single Restricted Boltzmann Machine (RBM), the hidden layer of the lower RBM is the visible layer of the upper RBM. Then, train this RBM one by one to get the initial weights of DBN.
(2) Train the DBN as a standard feedforward network and use the label data of the samples and BP [13,14] to adjust the parameters of the model.

The Mobile Advertisement CTR Predicting Model Based on DBN
Our mobile advertisement predicting model mainly contains three steps: (1) The pretreatment of original sample feature data.
(2) Taking the DBN to build valid features from original ones in a deep level.
(3) Taking the newly built vectors as the input of the Logistic Regression (LR) model and predicting the CTR.
The framework of the whole model is shown in Figure 4.

Computational Intelligence and Neuroscience
Compared to low level models, DBNs are known to suffer from overfitting due to the large number of parameters. And the training of these parameters needs a big amount of layer-by-layer computation, which will greatly slow down the training process when the dimension of the sample is too high. To solve this problem, we adopt the normalization method described in Section 4.3.1 on each feature in the whole dataset, which will decrease the influence of the feature values whose frequencies are extremely low.
In the structure of DBN, the activation function is a sigmoid function, which makes the best use of the nonlinear characteristics of the deeper model. Samples that are processed enter the model through the visible layer; after the detection of hidden layers, they finally become deeper detected features at the top of the model. At last, we put the deeper detected features into the LR model to get the prediction of CTR combined with label information. The output layer of the model contains one sigmoid unit, together with the units in the top hidden layer of DBN that comprises the LR model. The activation probability of the units in the output layer is as follows: In formula (8), is the status of units in the top layer of DBN, is the tied weight between the units in the output layer and the ones in the top layer of DBN, and stands for the offset value of the output layer.

Experiments and Results
In this chapter, we design the experiments to test our model and analyze the result.

4.2.
Dataset. The dataset [15] we used is the Click-Through Rate Prediction Competition Dataset from Kaggle platform. The original data comes from Avazu. The training dataset contains the information of over 40 million mobile advertisements in 10 days. As a sample, we use 4,112,995 samples on the first day. We split the set into a testing dataset and a training dataset by the ratio of 9 to 1. That is to say, every single sample has 90% chance to be selected as a training set and 10% chance to be selected as a testing set. Finally, we get 412,299 samples in the test set and 3,710,696 in the training set.
The analysis of the click result of the whole dataset is as shown in Figure 5.

The Pretreatment of the Dataset.
Before we begin our experiment, we first do some pretreatment on our original datasets. What is shown in Table 1 is the detailed description of the dataset; the ID is the unique identification of samples which is not going to be the input of the model; and the click is the label of the sample.
As for other features, first, the continuous features are divided according to the interval and then converted with digital coding to category features. Here is an example. We divide the feature hour into 24 different hours and the code from 0 to 23. Second, as for the category features, we count the frequency of the different features, as shown in Table 2, which is the frequency of different values of the feature banner pos.
At last, we normalize all features and use them as the input vector of DBN. There are 22 features, so the length of the input vector is 22. Take the feature banner pos as an example; using the frequencies in Table 2, we can normalize this feature value with the following formula: where is the normalized value of the th feature value and max and min are the maximum and minimum feature values, respectively.

Experimental Setup.
To avoid overfitting, we adopt the Momentum Method and Weight-Decay Strategy to train the Computational Intelligence and Neuroscience 5 within which Δ ( − 1) is the updated values of the last update. To accelerate the convergence speed of training, we adopt the minibatch method to train the model. Each batch contains 500 samples. The parameter of the training of the model is as shown in Table 3.

Experiment.
For further validation of our research result and analysis of the performance of our fusion model based on DBN, the experiment target consists of four parts.
(1) To analyze the effect of different numbers of hidden layers on the prediction of CTR.
(2) To analyze the effect of different numbers of the units in hidden layers on the prediction of CTR.
(3) To analyze the influence of different epochs on the prediction of CTR.

Evaluating Indicator.
The curve in AUC usually means the Receiver Operating Characteristic (ROC), which is usually used to evaluate the performance of a two-class classifier. While predicting a binary classification problem, it may come to four situations as shown in Table 4.
The -axis of the AUC represents the false positive rate while the -axis represents the true positive rate. AUC The number of hidden layers values range from 0.5 to 1, with higher value entailing better prediction performance.

The Influence of Different Numbers of Hidden Layers.
In this chapter, we carry out an experiment on the influence of different numbers of hidden layers in DBN. First, we set all hidden layers' numbers to 256, and the result is shown in Figure 6.
What can be seen from Figure 6 is that the AUC of the prediction result has risen apparently with the increase of the hidden layers while it reduced rapidly when the number was over 4, which is due to overfitting problems. It turns out that the more hidden layers the DBM has, the more complete the feature learning is. And for avoiding overfitting problems, we need to carry out experiments to determine the number of the hidden layers.

The Influence of Different Numbers of Units on Hidden
Layers. The number of units in hidden layers in DBN is extremely important for its great influence on the performance of the network as well as the direct reason of the overfitting problem. However, there are no universal parameter adjustment methods for this number in theory. So, in this part, we carry out experiments on the influence of this number. The set of parameters is as shown in Table 3.
First, we investigate the influence of the units in the first layer of DBN by fixing the numbers of units in the other layers to 256 according to experiences and set the ones in the 1st as the Geometric Progression of 2 3+ ( ∈ + , ≤ 7). The result is shown in Figure 7.
With the increase of units in the first hidden layer, the AUC of the results increases while the overfitting problem happens again when the unit number is more than 512, indicating that 512 is the best value, 0.43% better than the default value 256, and 30.61% better than the worst value 16. Then, we carry out the same experiment on the other hidden layers using the same parameters set in Table 3 and the results are shown in Figures 8-10. Figures 8-10 show that all the AUC of the four layers get the highest values when the number of units is set to 512. So, we can come to a conclusion that the best values of the units       in hidden layers are all 512 in our model. Too many units in the hidden layers may cause overfitting and make the model overly sensitive and unstable. The overall accuracy of the model increased by 1.99%, from 0.6988 to 0.7127, after we adjusted the default parameters (two hidden layers with 256 units in each layer) to the best ones (four hidden layers with 512 units in each layer), which indicates the importance of parameters' adjustment clearly. Figure 11 shows the AUC of DBNs and LR in different epochs. We can see that, with the increase of epochs, DBNs get better and better performance until the epochs reach 150, and then the AUC will plateau. This reflects that the DBNs have been trained sufficiently.

The Influence of Different Numbers of Epochs.
We also perform an experiment on LR as contrast, the parameters of which are shown in Table 5. From Figure 11, we can see that the prediction of DBNs is better than the one of LR when the epochs are more than 30. And the performance of DBNs will not go down with the increase of epochs like LR after being trained sufficiently. This is because the LR model is a shallow model which will have an overfitting problem when the epochs become too much. Figure 12, we can see that the performance of our fusion model is 5.57% better than of Logistic Regression (LR) and 5.80% better than of Support Vector Regression (SVR) in estimation accuracy. This is because the DBN has advantages in detecting latent features which represent the essence of a sample in a better way than shallow models.

Conclusions
To solve the problem that the key features of advertisement data are not detected completely and the nonlinear relationship between different features cannot be fully reflected in traditional Click-Through Rate (CTR) estimation models, we propose a new fusion model based on Deep Belief Nets to predict the CTR of mobile advertising. Our model takes advantage of the latent feature detecting ability of the DBN and the simplicity of traditional Logistic Regression models. We also describe the derivation and training method of the model and design experiment to discuss the effect of different numbers of hidden layers as well as units in them on the prediction result. Results show that our fusion model has better performance in estimation accuracy, with a 5.57% increase compared to the classic Logistic Regression (LR) model and 5.80% increase compared to the Support Vector Regression (SVR) model.

Conflicts of Interest
The authors declare that they have no conflicts of interest.