Predicting Click-Through Rates of New Advertisements Based on the Bayesian Network

Most classical search engines choose and rank advertisements (ads) based on their click-through rates (CTRs). To predict an ad’s CTR, historical click information is frequently concerned. To accurately predict the CTR of the new ads is challenging and critical for real world applications, since we do not have plentiful historical data about these ads. Adopting Bayesian network (BN) as the effective framework for representing and inferring dependencies and uncertainties among variables, in this paper, we establish a BN-based model to predict the CTRs of new ads. First, we built a Bayesian network of the keywords that are used to describe the ads in a certain domain, called keyword BN and abbreviated as KBN. Second, we proposed an algorithm for approximate inferences of the KBN to find similar keywords with those that describe the new ads. Finally based on the similar keywords, we obtain the similar ads and then calculate the CTR of the new ad by using the CTRs of the ads that are similar with the new ad. Experimental results show the efficiency and accuracy of our method.


Introduction
Search engine has become an important means for finding information on the Internet today.Most classical search engines are funded through textual advertising placed next to their search results.Search engine advertising has become a significant element of the Web browsing experience [1].The advertising revenue of the three classical search engines, Google, Yahoo, and Bing achieves at least 25 billion dollars per year and is still rising gradually [2].Search engine advertising usually uses the keyword auction business model, and the advertisers pay cost of the keywords.The primary means is pay-per-click (PPC) with a cost-per-click (CPC) billing, which means that the search engine is paid every time the ad is clicked by users.An ad's popularity can be measured by its CTR, by which the ads on a search engine are generally ranked.The expected revenue from an ad is a function of CTR and CPC: CTR× CPC.The probability that a user clicks on an ad declines rapidly, as much as 90%, with the display position.
Choosing the right ads for the query and the order in which they are displayed greatly affects the probability that a user will see and click on each ad [1].So, this rank has a strong impact on the revenue that a search engine receives from ads.Further, showing users ads that they prefer to click on can improve users' satisfaction.Thus, accurately predicting the CTR of an ad is critical for maximizing the revenue and improving user satisfaction.
In recent years, CTR prediction has been widely concerned in academic communities of computational advertisement.For example, Agarwal et al. [3] divided the CTR prediction process into two stages.First, keywords are predefined as conceptual levels, and then the CTR for each different area is calculated in each level.Chakrabarti et al. [4] predicted the CTR based on the logistic regression and employed a multiplicative factorization to model the interaction effects for several regions.Regelson and Fain [5] predicted the CTR based on term level and adopted hierarchical clusters for the low frequency or completely novel terms.Dembczynski et al. [6] predicted the CTR in view of decision rules.Xiong et al. [7] designed a model based on the continuous conditional random fields and considered both features of an ad and its similarity to the surrounding ones.Actually, considering the accurate recommendation of ads in line with their characteristics, CTR prediction is dependent on both users and advertisements.Meanwhile, from the behavioral targeting point of view, CTR can be increased by obtaining the users' favors pertinently, which were also intensively studied recently [8][9][10][11].Wang and Chen [12] developed several machine learning algorithms, including conditional random fields, support vector machines, decision tree, and back propagation neural networks, to learn users' behaviors from searching and clicking logs in order to predict the CTR.
In general, the above methods are only suitable for the ads that have plentiful historical click logs, except for the new ads (without plentiful historical click logs).It is known that all advertising impressions and clicks follow a mathematical relationship known as power-law distribution [3], and the search keyword frequency follows the power-law distribution as well [5].This means that a large number of keywords will be concerned only in a small number of search behaviors in a given period of time, and a small number of search behaviors will imply a small number of potential impressions.Thus, the CTRs of a large number of new ads are unknown, while the prediction is not trivial since we do not have plentiful historical data about these ads.This is exactly the focus of this paper.
It is natural to consider the uncertainties in ads and CTR prediction, while uncertainty related mechanisms actually have been incorporated into CTR prediction in recent years [13][14][15].For example, Graepel et al. [2] presented a dichotomy method to predict the sponsored search advertising CTR and an online Bayesian probability regression algorithm.Chapelle and Zhang [16] proposed the model for users' browsing behavior to make CTR prediction.However, the above methods cannot be well suitable for CTR prediction of new ads.Fortunately, it is well known that Bayesian network (BN) is an effective framework of representing and inferring uncertainties among random variables [17].A BN is a directed acyclic graph (DAG), where nodes represent random variables and edges represent dependencies among these random variables.Each variable in a BN is associated with a conditional probability table (CPT) to give the probability of each parent state.By means of BN-based probabilistic inferences, the unknown dependencies can be obtained under the current situations.Thus, in this paper, we adopt BN as the underlying framework for predicting the CTRs of new ads.For this purpose, we will have to address the following two problems: (1) constructing a BN from the keywords that are used to describe the ads, (2) providing an efficient inference mechanism to predict the probability distributions of all possible values for the given keywords that are used to describe the new ad.
To construct the BN from the keywords that are used to describe the ads, we first find out the keywords appearing in the user queries and the ads.If the same keyword appears in the user queries and the ads simultaneously, then the ads associated with this keyword may be clicked by users.So, we use these keywords as BN's nodes, where the edges between nodes are to describe the relationship between similar keywords.The constructed BN is called keyword BN, abbreviated as KBN.
To predict the CTR of the new ad, we could use similar ads' CTRs.In order to obtain similar ads, we use the probabilistic inferences on the KBN to obtain the similar keywords at first.This makes the BN be reasonably looked upon as the underlying model of probabilistic inferences for predicting the CTRs of new ads.Many algorithms for BNs exact inferences have been proposed [18], but these methods are run in exponential time, which are not efficient and suitable enough with respect to the BN-based inferences, especially over large scale keyword BNs.Thus, based on Gibbs sampling [18], we propose an approximate inference algorithm of the KBN for obtaining the probability distribution to find similar keywords and ultimately predict the new ad's CTR.
Generally, the main contributions of this paper are as follows.
(1) We propose an efficient method to construct the KBN from the keywords to describe the given ads, as the basis of probabilistic inferences and CTR prediction.
(2) We propose an algorithm for KBN's approximate inferences to predict the probability distributions of possible values for the known keywords and correspondingly give the idea to predict the new ad's CTR.
(3) We implement the proposed algorithms and make preliminary experiments to test the feasibility of our method.
The remainder of this paper is organized as follows.In Section 2, we construct a KBN from the given keywords.In Section 3, we infer the probability distributions of the known keywords and then predict the new ad's CTR.In Section 4, we show experimental results and performance studies.In Section 5, we conclude and discuss future work.

Keyword Similarity Model
2.1.Basic Idea and Definitions.Search engine advertising usually uses keywords to describe ads, and advertisers pay for the cost of these keywords.Thus, in this paper, we use the set of keywords to describe ads and user queries, defined as follows.
If the same keyword  appears in   and  simultaneously, then the ads associated with  may be clicked by users.To depict the associations among , , and   , we focus on all the keywords like , denoted as   .As mentioned in Section 1, we adopt the graphical model as the preliminary framework, where nodes correspond to the keywords in   and Actually, (1) describes the cooccurrence of the keywords in the user queries and the ads whose CTRs are unknown (i.e., are to be predicted).This means that   reflects the inherent functionality or sense that may be requested by users.We look upon this as the similarity of keywords.To establish the model for globally representing the qualitative and quantitative similarity among keywords, as well as the uncertainty of the similarity, we first describe the basic concept of BN as follows.
A BN is a DAG  = (, ), in which the following hold [19].
(1)  is the set of random variables and makes up the nodes of the network.(2)  is the set of directed edges connecting pairs of nodes.An arrow from node  to node  means that  has a direct influence on  (,  ∈  and  ̸ = ).
(3) Each node has a CPT that quantifies the effects that the parents have on the node.The parents of node  are all those that have arrows pointing to it.
Based on the definition of a general BN, following we give the definition of keyword Bayesian network.Formally, a KBN is a pair  = (  , ), defined as follows.
Definition 2. We use a pair  = (  , ) to define the keyword Bayesian network (abbreviated as KBN) as follows.
(1)   = (  , ) is the structure of the KBN, where   = { 1 ,  2 ,  3 , . . .,   } is a set of nodes in   .Each node in the DAG corresponds to a keyword, and each node has two values (1 and 0).The value 1 indicates that the keyword is in the keyword set of ad   , and the value 0 indicates that the keyword is not in the keyword set of ad   .If   →   holds, then we define   as a parent node of   , and we use   (  ) to represent the set of parent nodes of   .(2)  is a set of conditional probabilities and consists of each node's CPT.

Constructing KBN from Ads' Keywords.
To construct the KBN from the given keywords is to construct the DAG and calculate each node's CPT.Without loss of generality, the critical and difficult step in KBN construction is to construct the DAG, which is consistent with that of general BN's construction [18].Then, we can calculate each node's CPT easily based on the constructed DAG.The set of directed edges connecting pairs of nodes in the KBN describes the relationship between similar keywords.This means that, to describe the relationship between similar keywords, we will have to address the following two problems: (1) whether there is similar relationship between keywords, namely, whether there is an edge between the corresponding two nodes; (2) how to determine the edge's direction.
For problem (1), we consider the ratio of the number of ads that contain the specific two keywords and the number of ads that contain at least one of them.The direct similarity of the two keywords is established upon the above ratio.With the increase of the ratio values, the similarity between the two keywords will be increased correspondingly.If the ratio is higher than a certain threshold, we can regard that there is an undirected edge between these two keywords.Therefore, we use sim(  ,   ) to define the relationship strength between keywords   and   as follows: where (  = 1,   = 1) denotes the number of ads that contain the keywords   and   .(  = 1) + (  = 1) − (  = 1,   = 1) denotes the ads that contain the keywords   or   .
Let  be the given similarity threshold, which could be determined based on empirical tests or specific experience.If sim(  ,   ) > , we believe that   and   are similar.That is, there should be an undirected edge between the nodes corresponding to these two keywords, respectively.
For problem (2), we consider any two nodes that are connected by an edge.We can compare the probability of the case that   will also appear when   appears, denoted as (  |   ), as well as the opposite case, denoted as (  |   ).Then, we can compare the probability of   's appearance on that of   and the probability of   's appearance on the that of   .This makes it possible to obtain the direction of the edge between   and   , which reflects the dominate keyword concerned in the mutual dependence between these two keywords.Specifically, (  |   ) and (  |   ) can be calculated as follows: where estimate the parameters of a statistical method by counting the frequency from tuples.We adopt this method to compute the CPT for each variable easily upon the KBN structure, where the probability is = the number of the ads that contain   and  (  ) the number of the ads that contain  (  ) . ( It is worth noting that the execution of computing CPTs in the KBN by ( 4) is dependent on the executing time of querying the sample data.This means that (4) can be evaluated in different consuming times under different mechanisms of storage and query processing of the sample data.In particular, we adopted MongoDB in our implementation, and the efficiency of CPTs computation will be tested by experiments, given in Section 4.
}, and  7 = { 2 ,  4 }} be user queries and the keywords of the ads with known CTRs.

By
Step 1 in Algorithm 1, we compare the set  and the set   to get the nodes in the DAG,  1 ,  2 ,  3 , and  4 .

Approximate Inferences of KBN.
In order to predict the CTR of a new ad, we are to find the keywords of the ads with known CTRs, which are similar to the keywords of the new ad.Thus, the similarity of ads' functionality or sense can be inferred and discovered.Consequently, we can use the similarity between the keywords of the new ad and those of the ads with known CTRs to find the ads that are related to the new ad.
Although we can obtain the keywords with the direct similarity relationship by the ideas presented in Section 2.2, there are still many desirable indirect similarity relationships between different keywords, which is also the motivation of deriving the KBN.Thus, we make use of the KBN inferences to find the indirect similarity relationships between the keywords of the new ad and those of the ad with known CTRs.For this purpose, we can calculate the conditional probability of the keyword (KBN node) of the new ad in the case of the given keywords as evidence in KBN inferences.
It is known that Gibbs sampling is a Markov chain Monte Carlo algorithm and it always generates a Markov chain of samples [20][21][22].Gibbs sampling is particularly well-adapted to sampling the posterior distributions of a BN [23].To infer the probability distributions of all possible values, we extend the Gibbs sampling algorithm to the KBN inference and give the approximate inference method in Algorithm 2. The basic idea is given as follows.(1) Initialization: ad's keywords are in the KBN.Then, for the query expressed as ( = 1 | ), where  is the target (i.e., keyword in the new ad) and  is the evidence.(2) Nonevidence nodes are sampled randomly, including all the keywords in the KBN except the new ad's keywords.From the CPTs of the KBN, the conditional probabilities of nonevidence nodes can be obtained given their Markov chains, respectively.Thus, the new state can be achieved.This process will be iterated until the given threshold sampling time is reached.(3) A set of samples can be generated, and the corresponding probability distributions can be achieved.
The desired probabilities of ( = 1 | ) can be obtained as the answer to the given query.
By Algorithm 2, we can obtain the keywords with direct or indirect similarity relationships.If (  = 1 |   = 1) > , then we regard that the keyword   is similar to   .For a KBN with  nodes, the invocation times of KBN inferences of Algorithm 2 will be ( 2 ).It was concluded that [18,23,24], as long as the sampling times are many enough, the Gibbs sampling always converges to a stationary distribution.Particularly, Russell et al. [18] have given the following conclusion to guarantee that the estimate of posterior probabilities returned by a Gibbs sampling algorithm will converge to the true answer if the total sampling times are large enough, stated as Theorem 4, which also holds for Algorithm 2 given in this section when inferring the KBN.
Let ( ⃗  → ⃗   ) be the probability that the process makes a transition from state ⃗  to ⃗   .Let   ( ⃗ ) be the probability in state ⃗  at time  and let  +1 ( ⃗   ) be that in state ⃗   at time +1.Given   ( ⃗ ), we can compute  +1 ( ⃗   ) by summing: We say the process has reached its stationary distribution if   =  +1 , which means that the sampling process is converged and This guarantees the convergence and effectiveness of Algorithm 2 theoretically.Following, we illustrate the execution of Algorithm 2 by an example.
By Step 1 in Algorithm 2, the nonevidence nodes  2 and  3 are initialized randomly.We suppose the current state is The following steps are executed repeatedly.

By
Step 2 in Algorithm 2,  2 is sampled given the current state of its Markov blanket variables.In this case, we can obtain the probabilities when  2 is sampled as 1 and 0, respectively, ( If we suppose  2 = 0.50, then the result will be 1 and the new state will be [ 1 = 1,  2 = 1,  3 = 0,  4 = 1]. 3 is sampled, given the current state of its Markov blanket variables.In this case, we can obtain the probabilities when  3 is sampled as 1 and 0, respectively, ( If we suppose that  3 = 0.10, then the sampling result will be 0 and the new state will be [ 1 = 1, 2 = 1,  3 = 0,  4 = 1] as well. By Step 3 in Algorithm 2, if the process visits 60 states of  4 = 0 and 140 states of  4 = 1, then we can obtain ( 4 = 1 |  1 = 1) = 0.7, which is the answer to the given query.

KBN-Based CTR Prediction.
Based on the results of KBN inferences given in Section 3.1, we can obtain the set of similar ads described by the similar keywords.Consequently, we can predict the CTR of the new ad by using the similar ads' CTRs that are known already.Let  = { 1 ,  2 , . . .,   } denote the set of similar ads to the target new ad, and  = { 1 ,  2 , . . .,   } denote the corresponding CTRs of .The CTR of the new ad can be calculated as follows: This means that we simply use the average CTR of the similar ads (with known CTRs) as the prediction of the new ad's CTR.The idea is illustrated by the following example.Example 6. Suppose  2 ,  3 , and  4 are similar keywords of  1 by Algorithm 2. If the new ad has only one keyword  1 , then we can find the similar ads  3 and  4 that contain  1 ,  2 ,  3 , and  4 .So, we can use the CTRs of  3 and  4 to predict the CTR of the new ad by (7), and we have   = ( 3 +  4 )/2.

Experimental Results
To test the feasibility of the ideas proposed in this paper, we implemented our methods for constructing and inferring KBN, as well as that for predicting the new ad's CTR.We mainly tested the efficiency of KBN construction, the efficiency, correctness, and convergence of KBN inference, and the accuracy of the KBN-based CTR prediction.
In the experiments, we adopted the test data from KDD Cup 2012-Track 2 [25].All the data sets were stored in MongoDB and all the codes were written in Python 2.7.The machine configurations are as follows: Pentium(R) Dual-Core CPU E5700 @3.00 GHz with 2 GB main memory, running Ubuntu 12.04 64-bit operating system.According to (1), we merged all the ads' keywords and then obtained 27512 keywords in total as the original test dataset.

Efficiency of KBN Construction.
We chose 10, 20, . . ., 100 keywords from the original dataset and tested the execution time of KBN construction.For each KBN, we recorded the average time of 10 times of tests.The execution time of DAG construction, CPT calculation, and the total time were recorded and shown in Figure 2 by logarithm scale.It can be seen that the execution time of DAG construction is much smaller than that of CPT calculation, and the total time mainly depends on that of CPT calculation.This is consistent with the actual situations, since the CPT calculation of each node concerns many good statistical computations (i.e., aggregation queries on MongoDB), which are exponential to the parent number of this node.The above observation makes our method for KBN construction be feasible with respect to small scales of keywords under the hardware configurations like our experimental environment.Basically, the execution time of KBN construction is increased lineally with the increase of the number of the nodes when the number of keywords is not larger than 100.Thus, from the efficiency point of view, the above observations make our further improvement of KBN construction be feasible by incorporating data-intensive computing techniques for the aggregation query processing.This is exactly our future work.

Precision, Efficiency, and Convergence of KBN Inference.
To test the precision of the KBN inference, we established the tests on the KBN in Example 3.For all possible specific values of the same evidence and target variables in the KBN in Figure 1, we compared the inference results of the KBN obtained by Algorithm 2 and those obtained by Netica [26] that were adopted as the criteria of the precision of KBN inferences.The errors of KBN inferences were defined as the difference between the above two inference results for each pair of evidence and target.The inference results based on Netica, those based on Algorithm 2, and the corresponding errors are shown in Table 1.It can be seen that the maximal, minimal, and average errors are 10.9%, 7.8%, and 9.15%, respectively, which can be accepted generally and looked upon as precise basically.This also means that the KBN inference results are precise to a certain extent and thus well verifies the applicability and feasibility of KBN and its approximate inference algorithm for finding similar keywords.To test the efficiency and convergence of Algorithm 2, we adopted the KBN constructed from the test dataset.With the increase of the number of sampling times, we recorded the execution time of Algorithm 2 and the inference results under different number of keywords, shown in Figures 3 and  4, respectively.It can be seen from Figure 3 that the execution time of Algorithm 2 was increased linearly, and the increase of execution time with the increase of keywords under the same sample size is not sensitive as well.It can be seen from Figure 4 that the inference results converge to a stable value rapidly after less than 4000 iterations of sampling.The observations from Figures 3 and 4 show that Algorithm 2 for KBN inferences is scalable and efficient to a certain extent.

Accuracy of CTR Prediction.
We randomly selected 500 ads from the test dataset, and we compared the ads' CTRs predicted by (7) in Section 3.2 and the real CTRs of these ads, where each ad's real CTR was evaluated by the following welladopted metric: click/impression (i.e., the average click for each impression).To test the accuracy of the CTR prediction, we selected 10 ads, for each of which we recorded the ad ID, real CTR (denoted as rCTR), predicted CTR (denoted as pCTR), and the difference between rCTR and pCTR (denoted  as Error), shown as in Table 2.We can obtain the maximum, minimum, and the average error of the predicted CTRs as 54%, 0.7%, and 19.6%, respectively.Thus, we can conclude that the method for CTR prediction proposed in this paper can be used to predict the new ad's CTR accurately on average.Accordingly, to improve the general accuracy of the CTR prediction is exactly also our future work.

Conclusions and Future Work
Predicting CTRs for new ads is extremely important and very challenging in the field of computational advertising.In this paper, we proposed an approach for predicting the CTRs of new ads by using the other ads with known CTRs and the inherent similarity of their keywords.The similarity of the ad keywords establishes the similarity of the semantics or functionality of the ads.Adopting BN as the framework for representing and inferring the association and uncertainty, we mainly proposed the methods for constructing and inferring the keyword Bayesian network.Theoretical and experimental results verify the feasibility of our methods.
To make our methods applicable in realistic situations, we will incorporate the data-intensive computing techniques to improve the efficiency of the aggregation query processing when constructing KBN from the large scale test dataset.Meanwhile, we will also improve the performance of KBN inferences and the corresponding CTR prediction.In this paper, we assume that all the new ad's keywords are included in the KBN, which is the basis for exploring the method for the situation if not all the keywords are not included (i.e., some keywords of the new ads are missing).Moreover, we can also further explore the accurate user targeting and CTR prediction based on the ideas given in this paper.These are exactly our future work.

2 )
For  ← 1 to  Do (i) Compute the probabilities of the selected variables based on the state  (−1) , Randomly select one non-evidence variable   from : ← (  = 0 |  (  ) ) + (  = 1 |  (  ) ), where  (  ) is the set of the values in the Markov chain of   in the current state  (ii) Generate a random number   ∈ [0, ] and we determine the value of   :
(3)]s the number of ads that contain   when the ads contain   .(= 1) means the number of ads containing   .So, we can compare (  |   ) and (  |   ).If (  |   ) > (  |   ), then the edge should be directed from   to   , since the dependency of   on   is larger than that of   on   .According to the above discussion, we summarize the method for KBN construction in Algorithm 1.It should be noted that no cycles will be generated when executing Algorithm 1, since the direction of the edge between   and   is dependent on (  |   ) and (  |   ), where (  |   ) > (  |   ) or exclusively (  |   ) > (  |   ) holds.We consider a DAG with  nodes; the similarity computations and direction decisions (Step 5∼Step 9) will be done for ( 2 ) times.Example 3 illustrates the execution of Algorithm 1. Likelihood estimation[18]is commonly used to Input:  = { 1 ,  2 , ...,   }, a set of query;  = { 1 ,  2 , . ..,   }, a set of the known ads' keywords; , threshold of the similarity of two keywords (0 ≤  ≤ 1) Output:  = (, ), DAG of the KBN Variable: , the number of nodes in the KBN   |   ) ≥ (  |   ) Then  ←  ∪ { →   } // By(3)

Table 2 :
Correctness of CTR prediction.