Advertisement Click-Through Rate Prediction Based on the Weighted-ELM and Adaboost Algorithm

Accurate click-through rate (CTR) prediction can not only improve the advertisement company’s reputation and revenue, but also help the advertisers to optimize the advertising performance. There are two main unsolved problems of the CTR prediction: low prediction accuracy due to the imbalanced distribution of the advertising data and the lack of the real-time advertisement bidding implementation. In this paper, we will develop a novel online CTR prediction approach by incorporating the real-time bidding (RTB) advertising by the following strategies: user profile system is constructed from the historical data of the RTB advertising to describe the user features, the historical CTR features, the ID features, and the other numerical features. A novel CTR prediction approach is presented to address the imbalanced learning sample distribution by integrating the Weighted-ELM (WELM) and the Adaboost algorithm. Compared to the commonly used algorithms, the proposed approach can improve the CTR significantly.


Introduction
With the development of the network technology and the communication technology, the Internet and the mobile Internet have been developed rapidly.Due to the popularity of smart phones, a variety of the mobile phone applications are invented.It is a niche market where the advertisers and the advertising companies pay more attention to the click-through rate (CTR) in the online advertising products.Usually the online advertising can be done in two different ways: one is the website search based advertising, which specifically refers to the searching engine depending on the user's key words that target the advertising content and the advertising spot.The other one is the real-time bidding (RTB) advertising, in which the advertising supplier platform provides no longer the advertising spot, but the specific users who visited the advertisement spot.The RTB advertisements enlarge the online advertising's directivity and accuracy [1].
Currently, there exists many research works on CTR prediction for Internet advertising.Menon et al. [2] proposed the maximum likelihood algorithm to estimate the parameters of the CTR probabilistic model.But this model can only be applied to the existing advertisements rather than the new advertisements.Richardson et al. [3] proposed the logic regression model to learn the CTR prediction model for searching advertising with the model features including the number of the keywords, the position of the figures in the page, and the other characteristics of the advertisements.Chapelle [4] proposed a stochastic regression approach based on the rate estimation machine learning framework for the Yahoo! to solve the CTR prediction problem by using four features as the model inputs.The norm-2 regularization term is added in the logistic regression model.This method can produce a sparser model to increase the number of the nonzero parameters to avoid the overfitting problem.Shao [5] proposed a high-level feature representation and a clickby-point prediction method based on the deep network that combines the high-level features and the basic features by using deep neural network model.
Most existing work on CTR prediction is focused on searching advertising that is seriously dependent on the keyword and the user input.With the development of the intelligent terminals and the mobile Internet, RTB advertising is increasing rapidly.More and more advertisers are in favor of the RTB advertising which will become the main trend of the Internet advertising in the future.At the same time, the 2 Scientific Programming In this paper, we will study the novel big data based online CTR prediction problem by incorporating RTB advertising with user profile system.A novel CTR prediction approach will be presented by integrating the Weighted-ELM (WELM) and the Adaboost algorithm to address the imbalanced learning sample distribution.We will perform the experiments using real advertising datasets to verify the effectiveness of the proposed approach.

The Experimental Dataset and the Evaluation Criteria
In this section, the experimental dataset and the evaluation criteria used in this study named Area Under Curve (AUC) will be briefly described.
The experimental dataset used in this paper for CTR prediction is the original data log provided by a domestic advertising company in China.There are 16 attributes in the original data log, with the details shown in Table 1.

User Profile.
Since the advertising log has large amount of data, we divide the above 16 attributes into 4 categories: the user's characteristics, the temporal characteristics, the ID characteristics, and the numerical characteristics.

The User's Characteristics.
In early practice, when the demand side platform receives the bidding request from the advertising agent, normally the user's information is not analyzed and all users were used for advertising.It is proved that this way of the information delivery cannot achieve the desired results as the u id and media id attributes used in the approach cannot cater to the users' interest.Thus the primary task is to establish the user profile system to obtain the user's age, gender, and interest preference for CTR prediction.The overall structure of the system is shown in Figure 1.The user profile system mainly includes the following functions: (i) Data pretreatment subsystem: take the responsibility of cleaning and preprocessing the advertising log data; (ii) Keyword split service: take the responsibility of segmenting the irregular text; (iii) Knowledge base: take the responsibility of providing the related mapping tables; (iv) User graph subsystem: the most important part of the user graph system: take the responsibility of integrating various parts of the data to build a user graph; (v) Data storage subsystem: take the responsibility of storing the results of the user graph.
The output of the user graph system includes the user's age, gender, and interest preference.The users' characteristics are obtained by using i id attribute to match the output of the user graph system.

The Time Characteristics.
The time characteristics include the field of push time in the log which represents the time of the ads request.According to the historical data, the users have different interests at different time periods, so the probability of a click behavior is also different.Based on this judgment, we split one day into six time periods which are late-night, morning, lunch time, afternoon, dinner time, and evening.The entire time information is organized by a six-dimensional vector.The six periods of time are shown in Table 2.

The ID Characteristics.
The ID characteristics in the dataset include the u id, the advertiser id, the media id, the 13:00∼18:00 T5 (dinnertime) 18:00∼20:00 T6 (evening) 20:00∼23:00 area id, the c id, the policy id, and the exchange id.There are a lot of ID attributes in the RTB advertising logs.If we do not have the filtering process of the characteristics, we would obtain a vector whose dimension may be up to several hundred thousands which increases the computational complexity seriously.Therefore, it is necessary to reduce the dimensionality of the feature space.We apply the method in [3] to remove the needless ID attributes that have no impact or little impact on the click-through rate.

The Numerical Characteristics.
Attributes in the dataset, such as the price base, the price win, the URL, and the u ip, affect the advertising's CTR as well.Take the price win for example, if the value is 0, it indicates that the advertising is not a successful bidding.If the value is nonzero, the different values reflect that the value of the advertising clicking is different.It is usually considered that the larger the value is, the better the advertising position is and the greater the probability of the clicking is.Therefore the numerical attributes need to be added to the feature vector.
In this paper, we adopted the maximum and minimum normalization method to normalize each characteristic to the value between 0 and 1.

Area Under Curve (AUC).
The prediction of the CTR is a binary classification problem while the proportion of the positive and negative samples is extremely uneven.In the actual advertising, the proportion of the positive and the negative samples is about 3 : 1000 or even lower.The samples are distributed in different categories unevenly, so the evaluation index of accuracy is not a good criterion to judge the performance of the classifier.
In this paper, AUC is adopted to measure the effect of the CTR prediction.In the process of calculating the AUC, the related curve is called ROC curve (receiver operating characteristics) [6].Traditional ROC curve is used in medical field.Currently it is often used in the field of data mining, machine learning, and pattern recognition.
When the ROC curve is drawn, the horizontal coordinate is FPR (False Positive Rate) and the vertical coordinate is TPR (True Positive Rate).The values of FPR and TPR can be calculated according to the formula (1).In (1), TP represents the fact that the samples are positive and the algorithm recognizes them as the positive samples; FP represents the fact that the samples are negative and the algorithm recognizes them as the positive samples; FN represents the fact that the samples are positive and the algorithm recognizes them as the negative samples; TN represents the fact that the samples are negative and the algorithm recognizes them as the negative samples [7].
It is obvious that if there are more users to click an advertisement, the rank of this advertisement will be in the front and the area under the ROC curve is larger which indicates that the performance of the advertising is better.
As an example, we draw the receiver operating characteristics (ROC) curves for the exchange id, the area id, the media id, and the advertiser id by the Weighted-ELM.Each AUC value of the curve is shown in Table 3.
From Table 3, we can see that the AUC values of the exchange id and the advertiser id are almost 0.5, which have no difference from the random results.This phenomenon has something related to the characteristics of the RTB advertising.The RTB advertisers do not want their own click conversion data to be used to optimize the other advertisers' effectiveness.
Compared to the AUC value of the advertiser id, the AUC value of the media id is increased slightly and up to 0.60.This case is related to the user's interest and the media id can reflect the user's interest.If the users visit a few apps frequently, the probability of clicking the ads would be increased.

The CTR Assessment
In this section, the ELM algorithm will be discussed, which will be used in the prediction of the CTR.Compared with the traditional classification algorithms SVM and BP, the ELM has the advantage of fast learning speed and accurate estimation results with easily setting the weights.Based on these advantages, the ELM algorithm has been developed rapidly since it was proposed several years ago.Because the proportion of the positive and the negative samples is extremely uneven, we proposed the Weighted-ELM algorithm to solve the problem in the next subsection.Because the ELM is the basis of the Weighted-ELM algorithm, we will firstly describe the original ELM in the following.

The ELM Algorithm.
In recent years, Huang et al. [8][9][10] and the other scholars proposed a fast algorithm of singlehidden layer feedforward neural network named extreme learning machine (ELM) [11,12].The specific structure of the ELM algorithm is shown in Figure 2.
The input weights and the bias of the hidden node in the ELM are chosen randomly.They do not need a series of iterative algorithm, which greatly saves the training time of ( In this equation, ( + ) is the neural network hidden layer node activation function.Usually it is sig, sin, hardlim, or tribas function;   is the connection weights between the th hidden layer node and the input nodes;   is the bias of the th hidden node;   is the connection weights between the th hidden layer nodes and the output node.
In the practical application of the algorithm, the output value of the network is equal or near to the actual output value.If the sample set and the neural network structure are close to the target value  with the zero error, we can get ‖ − ‖ = 0.The formula of the ELM algorithm can be abbreviated as where  is the output matrix of the neural network hidden nodes and  is the output weight matrix between the hidden layer nodes and the output layer node.
The main idea of the algorithm is how to get the output weight matrix  to make the training error ‖ − ‖ 2 and the output weight matrix ‖‖ minimum.That means how to make the following equation's value minimum: where  Γ is the generalized inverse matrix of .
If  is not full column rank,  could be obtained by the singular value decomposition (SVD) [5,13].

The Weighted-ELM Algorithm.
The basic ELM algorithm is very useful for many problems.However, there exist a lot of classification problems whose samples are imbalance, such as the advertising click rate problem.In order to solve the problem of the sample imbalance in the classification, Xu et al. proposed the Weighted-ELM algorithm [14].
The objective function of the ELM algorithm is In this equation, the condition is satisfied: ℎ(  ) =   −   ,  = 1, 2, . . ., .The first half of formula ( 5) is called the structural risk, and the latter part is called the empirical risk.The objective function of the Weighted-ELM algorithm is where W is an N × N diagonal matrix and the value of the matrix W is related to each training sample.Generally, if x  belongs to a few classes, the corresponding W  should be  given a relatively large weight.There are two methods for the value of .The first method is shown in the second method is as follows.
The process of training ELM is equivalent to solving the following problem: Similar to the original ELM,  is also solved in two ways: When  is small, When  is large, The output of the Weighted-ELM classifier can be given by

WELM-Adaboost Algorithm
This paper constructs the advertisement click rate prediction model by the proposed WELM-Adaboost algorithm which can adjust the weight of the data distribution.

The Advertisement Click Rate Prediction Model Based on
the WELM-Adaboost.In this paper, the Weighted-ELM is used as a weak predictor, and the weight distribution of each sample is adjusted by using the Adaboost algorithm to obtain multiple Weighted-ELM classifiers.These classifiers are combined into a strong classifier [14].
The advertisement click rate prediction process based on the WELM-Adaboost algorithm is shown in Figure 3.
The detailed steps of the algorithm are as follows: (1) From the sample data, randomly select  sample data as the training data.According to the positive and the negative samples of the distribution ratio, initialize the weights of each training sample.
(2) For each iteration  = 1 : , where  is the total number of the weak classifiers, the algorithm will repeat the following steps from (a) to (e): (c) Calculate the weight of the sequence   of the ELM  () according to its classification performance: (e) Renormalize the sample weight.
(3) After  iterations, the -group weak predictors are obtained.These weak predictors are merged into the final strong predictor (): where  is the number of the categories of the samples.

The Experimental Results
The experimental dataset used in this paper is the RTB advertisement raw log data provided by a domestic advertisement company in Beijing, China.Since the data is too large and the positive (or the negative) samples are seriously imbalanced, we randomly extract 1‰ of the data as the experimental data from the log.Click samples are recorded as positive; the other (nonclick) samples are negative.The proportion of the positive and the negative samples of the experimental data is almost 3 : 1000 which is a typical unbalanced data set.The statistics of the experimental data is shown in Table 4.In the table, Impression n means the number of the nonclick samples and the Click num means the number of the click samples.

The CTR Prediction Model.
From the above feature extraction process, we can conclude that the CTR of the RTB advertisement has a great relationship with the users' interest and the basic attributes.It has a little relationship with most of the ID characteristics.Finally, we select the temporal characteristics and the user characteristics like media id, area id, price base, and price win as the input of the prediction model based on the proposed method.
It is necessary to explore the influence of the number of the hidden nodes and the activation function on the speed and the accuracy of the ELM algorithm.
The ELM algorithm provides four kinds of activation functions.From Figure 4, we can know that when the number  of the hidden nodes is the same and the activation function is sine function, the AUC value is higher than the other three types of the activation functions about 5%.In addition, the training speed of the sine function is slower than the sigmoid function and the tribas function, but faster than the hardlim function.Considering the training time and the equipment cost, the number of the hidden nodes is set to 500, and the activation function is set to sine function.

The Comparison of the Algorithms' Performance.
We select logistic regression (LR) model and support vector machine (SVM) model as the comparison methods which are commonly used in other papers, and AUC values of three algorithms are shown in Table 5.
Table 5 shows that the performance of ELM is better than LR and SVM on all the tested datasets, which shows that we have chosen the reasonable characteristics and ELM algorithm is effective as well.
Finally, we selected the traditional ELM algorithm and the Weighted-ELM algorithm as a contrast method when the positive and the negative samples' ratios are set with different proportions; the trend of the AUC results of the three algorithms is shown in Figure 5.
It can be seen from Figure 5 that when the positive and the negative sample ratio is 1 : 5, the three algorithms' AUC values can reach 0.9 or more.When the positive and the negative

Figure 2 :
Figure 2: The structure of the ELM neural network.

4. 1 .
Adaboost Algorithm.Adaboost algorithm is one of the typical applications of the Boosting algorithm.The Adaboost algorithm chooses the very important features to construct a series of weak classifiers and cascade these weak classifiers to form a stronger classifier.The advantage of this algorithm is that it uses the weighted training data instead of the randomly selected training samples.It combines the weak classifiers and uses the weighted voting mechanisms instead of the average voting mechanism.

( a )
Apply the training samples to a classifier ELM  () with the initial sample weight   ; (b) Calculate the weight prediction error from the weights of the ELM  () whose results are misclassified samples; the weight prediction error is calculated according to err  = ∑  =1    (  ̸ = ELM  (  )) ∑  =1   .

Figure 4 :
Figure 4: The trend of AUC value with different activation function.

Table 1 :
Description of the experimental dataset.
research work on the RTB CTR prediction is still at the beginning stage.

Table 2 :
Information of a whole day.

Table 3 :
The AUC value of each ID attribute.

Table 4 :
The statistics of the experimental dataset.