Feature Dual Supervision Model for the Searches of Online Advertising Audiences

. Online advertising has become one of the most important strategies used by companies. Tey get the valuable results from Internet marketing and communication strategies. Terefore, it is necessary to study the click-through rate (CTR) model to search the potential audiences in online advertising. Te advertisers desire to search for potential candidates through a large number of queries for audiences in programmatic advertising. Facing such a large corpus, the most common method is that using two-tower model to learn user’s queries and ad representations, and then the similarity function is applied to match the feature representation to get the potential audiences related to the ad. However, in the process of feature extraction, there is a lack of information interaction between the two towers, resulting in the loss of details in the representation. In order to alleviate the lack of information interaction between the networks in the two-tower model during feature extraction. In this paper, we propose a novel model named Feature Dual Supervision Model (FDSM), which integrates by Feature Expression Unit (FEU) and Feature Supervision Unit (FSU). Te FEU is used to extract ads or users features, and FSU generates a weight vector to supervise the working process of the FEU. In addition, we propose a feature cross-layer with bridge connections in FDSM to achieve efective feature interaction between ad and user representations. Finally, we conduct experiments on the Tencent Lookalike and MovieLens datasets. Te experimental results indicate that the FDSM model outperforms other state-of-the-art CTR prediction models in audience expansion.


Introduction
With the rapid development of the Internet, online advertising provides a common marketing experience when people are accessing services using intelligent devices. Online advertising refers to advertisements displayed in media [1]. Diferent from traditional advertising, it has formed a crowd as the target, product-oriented technology delivery model. For given an ad and its historical audience (seed users), audience expansion aims to fnd potential audiences that are similar to the seed users for the ad. For example, using the user's searched keywords, topics, history of visit behaviors, interests, and so on the programmatic advertising system can accurately fnd the potential audiences for an ad through audience targeting technology.
Programmatic advertising (PA) refers to a kind of advertising form that applies technology to serve advertising trading and management in big data feld. Advertisers can programmatically purchase media resources, applying algorithms and technologies to automatically achieve precise target audiences [2]. Te technology of PA analyzes millions of ads data in real time, which enabling PA ads to accurately refect the interests of users at the exact moment, and they are most likely to click on an ad. Terefore, programmatic advertising is a new marketing technology through the Internet and emerging technologies.
In the programmatic advertising system or recommendation system, click-through rate (CTR) is an important metric, which is defned to forecast the probability that a user will click a display ad or recommended item on web page [3]. Ten the system determines whether the ad or item will display on the user's page basing on this metric. In online advertising, the prediction result of CTR has a great infuence on the efect of online advertising. Terefore, the accuracy of CTR prediction is a key factor afecting the efect of advertising, user experience and platform revenue.
In order to improve the performance of the CTR model, efective feature cross is the most commonly applied optimization method. Early studies focused on designing and utilizing efective combination features, such as FM and FFM [4]. Tese models utilize expert experience to explore clear interactions between features, which inevitably cost a lot of labor in the industry. However, the current large-scale recommendation contains a large number of original features and potential highorder interaction features, which makes it difcult for expertexperienced feature engineering to comprehensively cover all interaction patterns in feature space, thus limiting the application of shallow models in the industry.
In nearly a decade, a large number of CTR prediction models based on deep learning are applied to explore higherorder implicit information in feature space. In this paper, we focus on optimizing the performance of the two-tower model. Tis model was frst applied in the domain of NLP. Te typical architecture is DSSM [5]. Te input of this model is a high-dimensional term vector about a query or a document. Ten, the DSSM passes its input through two neural networks with two diferent inputs, respectively, and maps them into semantic vectors in a shared semantic space. For Web document ranking, DSSM computes the relevance score between a query and a document with the cosine similarity function and ranks documents by their similarity scores to the query.
Despite great promise, there are still some problems in two-tower model. Since the feature vectors of the user's query and ad separately are fed into two diferent neural networks in the online retrieval service, and generating the highly concentrated vector representation, which leads to some detail information loss and sufers from a lack of information interaction between the two towers. In order to overcome this shortcoming, we propose a Feature Dual Supervision Model (FDSM) based on the two-tower model to enhance feature extraction capability and provide more fne-grained information at the feature cross-layer. Its network structure is summarized as follows: Users/Ads Feature Expression: During the process of extracting features in the user/ad tower, a feature expression unit is applied, which is made of multiple neural networks, and the structure among of them can be the same or diferent, but the dimension of the output vector should be the same. At the same time, the proposed of a feature supervision unit to monitor the process of feature extraction. Specifcally, the feature vector of user/ad will feed in feature expression unit and get multiple representations correspondingly, and then the supervision unit will give a score for every representation. Finally, the unique expression feature is obtained basing on all the representations and the scores. In this paper, the fully connected network is regarded as the feature expression network. Feature Cross-Layer: As in the ordinary two-tower model, the feature interaction between the extracted representations of user and the ad is required in this layer. In this paper, with the diference that the degree of match performed by the cosine function, it is no longer applied. A bridge connection module is proposed in this paper to combine the ad and user expression vectors, which are then fed to the network for feature interaction to perform CTR prediction.
Te main contributions of this work are summarized as follows: (i) We propose a novel Feature Dual Supervision Model (FDSM), which can enhance the feature extraction ability of users and ads information and obtain features with high performance expression. (ii) In the feature cross-layer, a bridge connection module is proposed to connect the extracted features, which can achieve feature interaction well, so as to improve the prediction performance of CTR for FDSM. (iii) We conduct experiments on two real datasets. Te experimental results have demonstrated that with feature supervision unit and cross-layer with bridge connection module, FDSM outperforms other stateof-the-art CTR prediction models in audience expansion system.
Tis paper is organized as follows: Section 2 introduces the mainstream model of CTR and its development context. Section 3 illustrates the design details of FDSM model proposed in this paper. Section 4 shows the details and results of the experiment. Section 5 summarizes the paper and prospects for the future work.

Related Works
In this section, general models related to CTR are summarized and introduces models about semantic matching in the NLP domain and then illustrates promotion applications of the two-tower model in the system of programmatic advertising and recommendation.
Early research focused on the design and utilization of efective combinatorial features, such as FM [4] and FFM [6]. Tese models mainly exploit expert experience on exploring explicit interaction between features. In recent years, CTR prediction models based on deep learning have emerged to explore higher-order implicit information in feature space. Deep learning-based CTR prediction models follow the pattern of "feature embedding & feature interaction." Te representative models include Wide&Deep [7], DeepFM [8], DCN [9], PIN [10], DIN [11], PNN [12], and the two-tower model [5], which jointly learn explicit and implicit feature interaction and fnally output matching information.
With the application of deep learning in natural language, many neural network models have been proposed to address semantic matching problems. Tese approaches are divided into two categories: representation-based learning 2 Scientifc Programming and interaction-based learning. Te models with a twotower structure are typical characteristics of representation-based approaches, such as DSSM [5], CLSM [13], LSTM-RNN [14], and ACR-I [15]. Each tower uses a diferent neural network to generate a semantic representation of the query or document. A matching function, such as inner product, is then applied to measure the similarity between the metric query and the document. Te interaction-based approaches learn the complicated relevance patterns between queries and documents. Te mainstream models are MatchPyramid [16], Match-SRNN [17], DRMM [18], and K-NRM [19].
In the feld of advertising and recommendation, the MV-DNN [20] extend the two-tower to jointly learn from features of items from diferent domains and user features by introducing a multiview deep learning model, which can learn the user's behavior patterns according to the rich user behavior features and improve the user experience on the web service. In advertising display system, Baidu proposed MOBIUS [21], which base on two-tower, to maximize CPM and reduce the diference between ranking and matching in the retrieval stage. However, the two-tower model sufers from a lack of information interaction between the respective towers as well as the imbalance of category data afects the performance of the model. Terefore, the DAT model not only customizes an augmented vector for each query and item to mitigate the lack of information interaction, but also proposes category alignment loss to align the item representation of uneven categories.

Methodology
In this section, we frst defne the problem of audience targeting and CTR prediction, then illustrate our proposed model in detail.

Problem Formulation.
Given a seed set S, and a candidate set C, audience targeting aims to extend S via selecting n users T from C (usually |S| ≪ |C|), such that the potential users T are similar to S. In this problem, each user u is usually represented by a low-dimensional dense vector that encodes the information of users' demographic profles and online behaviors [22]. In order to search similar users based on a seed set, we apply the CTR prediction methods. As a binary classifcation task, CTR represents a probability whether a user will click an ad campaign or an item displayed online system. Specifcally, for given a training set containing N samples (X, y), we indicate the input of a model as X � x 1 , x 2 , . . . , x f , which contains f features. X includes user features as well as ad features. All the features could be not only categorical, such as gender or occupation, but also continuous, such as the price about an item. y ∈ 0, 1 { } is the label of a sample, where y � 1 indicates that the user with positive feedback for an ad campaign, such as clicking on the advertisement, purchasing the product or downloading the APP, otherwise y � 0. Terefore, the CTR prediction model calculates the probability P(y � 1|X) for each instance X. Table 1 shows the notations in this paper. Figure 1, the overall framework of our proposed Feature Dual Supervision Model (FDSM) in this paper, which includes three modules: the feature embedding layer, the feature expression layer, and feature cross-layer. Te embedding layer transforms the instance X into a lowdimensional dense vector. Te feature expression layer extracts efcient feature representations of users and ads features, respectively. Te purpose of the cross-layer is to discover relationships between features, which predict the probability of CTR about whether the user will click the ads.

Embedding Layer.
Te CTR prediction model based on deep learning follows the "feature embedding & feature interaction" paradigm [23]. Te embedding module embeds each feature for an instance to a d-dimensional embedding vector. For the ith fled, the feature embedding vector can be obtained from the embedding lookup table as follows: where e i is the embedding vector, x i denotes the ordinal encoding of the ith feld about instance X. E i ∈ R S i ×d is the embedding matrix, and S i , d are the size of the lookup table for the ith feld and embedding size, respectively. If the feld is multivalent, the mean pooling of feature embedding as the feld embedding representation: where n is the number of feature value in the ith feld. Terefore, we denote the output of embedding layer for a instance X, which contains f feature felds, as the embedding matrix as follows: In this work, we divide an instance X into two parts according to the characteristics of user and ad, denoted as X u � x 1 , x 2 , . . . , x μ and X a � x μ+1 , x μ+2 , . . . , x μ+] , respectively, where μ + ] � f, as shown in the left part in Figure 1. Te corresponding embedding representations of user and ad are obtained through the embedding layer as where "⊕" is vector concatenation operation, and e u , e u are the user u embedding vector and mean pooling vector, equally, e a , e a are the ad a embedding vector and mean pooling vector. Tus, the vector of e u and e a can be concatenated together as the ad supervision vector e ua , the vector of e a and e u can be concatenated together as the user supervision vector e au , which as follows: e ua � e u ⊕ e a ; e au � e a ⊕ e u , where e ua ∈ R (μ+1)d , e au ∈ R (]+1)d . Users feature expression: For the part of user feature expression is shown in the middle part of Figure 1, which consists of the Feature Expression Unit (FEU) and the Feature Supervision Unit (FSU), where FEU is responsible for the extraction of user information from the user embedding vector e u , while FSU supervises the process of user information extraction basing on the user supervision vector e au . Finally, the highly condensed representation vector of user information will be obtained through both units. Te details are as follows.
We can employ a unit FEU, which are made of the multiple fully connected networks; each network can extract the users' representations for the embedding vector independently. Generally, a single network only focuses on partial information during the process of extraction, which   cannot completely cover the characteristic about user. In order to address this challenge, multiple fully connected networks were applied to jointly discover users' implicit features jointly. A single deep fully connected network in FEU, with each deep layer having the following formula:

Features Expression Layer
where h l−1 ∈ R d l−1 and h l ∈ R d l are the (l − 1)-th and l-th hidden layer, respectively; W l ∈ R d l−1 ×d l is the weight matrix for the layer from (l − 1)th to lth; b l ∈ R d l is bias vector for the l-th layer. In particularly, there is for the frst layer, where h 0 � e u . Terefore, the user feature representation vector as output come from the last hidden layer in the ith fully connected network, and the matrix derive from the FEU unit with m fully connected networks can be summarized as follows: where f(·) denotes the fully connected network, the output vector of the ith network is the representation of the user's features, the subscript "L" denotes the last layer of the hidden layer in the fully connected network; and the matrix A u ∈ R d L ×m denotes the output from the FEU with m networks.
As for the unit of FSU, which is composed of a single fully connected network, the input is the user supervision vector e au . Diferent from the FEU, the activation function of ReLU is never applied in the last layer of a fully connected network, where the softmax activation function is applied, denoted as where the w u ∈ R m is the user supervised weight vector, which the dimensionality as the number of fully connected network in the FEU. Te softmax activation function in the last layer normalizes the output into a probabilistic representation. From the above exposition, it is clear that not only has the matrix A u , with m representation vectors, been derived from the FEU unit according to the user embedding representation, but also the m-dimensional user supervised weight vector w u is obtained through the FSU unit based on the user supervised vector. Finally, the output from both units as materials, the process of supervision operation is as where I u ∈ R d L is the representation of the user's fnal form, in this paper, which is called the user feature expression vector after implementing supervision. Ads feature expression: Te method of extracting ad representation in this part is completely consistent with the way of user feature expression. Here, the input of FEU unit is ad embedding vector e a , while the input of the FSU unit is an ad supervision vector e ua . Terefore, the output matrix, supervised weight vector, and ad's feature expression vector are as follows: In this module, we set n as the number of fully connected networks in FEU unit, then the output matrix A a ∈ R d L ×n hold n ad representation vectors, similarly, the dimension of ad supervised weight vector w a ∈ R n is n; I a ∈ R d L is ad's feature expression vector.
According to the above introduction process, FEU unit is composed of multiple fully connected networks. FEU can extract features representation matrix A from the same user or ad based on equation (9). Diferent networks can focus on features of specifc domains, but the number of feature tasks processed by multinetwork learning methods is limited [24,25]. Terefore, FSU unit is used to generate supervised weight vector w to make comprehensive judgment of multiple feature representations. Te specifc calculation process is shown in equation (11). Te FSU unit generates a decision weight w i for every vector h L i in, and then h L i × w i is operated to obtain the evaluation vector. Finally, all vectors in A are operated in the same way with all supervisory factors from FSU, and the mean value of all evaluation vectors is calculated. In this way, FEU and FSU work together to enhance feature extraction and presentation in online advertising systems.

Feature Cross-Layer.
Trough the previous description, we obtained the feature expression vectors I u and Scientifc Programming I a for user and ad. Both of them imply important information of the features, which can represent the information of user and ad more efectively. At this point, the feature cross-layer is designed to explore the relationship between a user and an ad, which plays an important part to obtained high performance of CTR prediction in ad service system. Its structure is shown in the right part of Figure 1. Te expression vectors of user and ad are passed through the bridge connection module, and then the output vectors are fed to the multiple fully connected networks to achieve the prediction of user's click-through rate for a given an ad. In this paper, the operation of the bridge connection based on expressions is designed as follows: where the operation " ⊙ " represents the Hadamard product, and the vector I b ∈ R 3d L is the output of the bridge connection unit. Finally, we use k fully connected networks to form a feature cross-module. Similar to the feature expression, where the output of the last layer with a single neuron of each fully connected network in the feature cross-module is expressed as follows: where the value h L ∈ [0, 1] of the output node of the network represents the probability of user clicking an ad, and b L is a bias. Terefore, the output of the ith network in the crosslayer and the combined output of k networks are where y is the prediction of CTR for the whole model through the average operation.
In the two-tower model, the cosine function was applied to calculate the CTR for the representation of the ad feature and the user feature to get the potential audience related to the ad. Te feature cross-layer with bridge connection module proposed in this paper has a certain signifcance to improve the accuracy of the model. First, the Hadamard product is used to calculate the matching degree between feature representations, and the prerealized lower-order features are crossed. Second, the extracted user and ad feature vectors are taken as part of the input of the deep network, enabling the model to explore the higher-order implicit information among the features [7]. Finally, the result of the Hadamard product is spliced with the feature vectors of users and ads as the input of the network to form a bridge connected module, as shown in equation (13). In this way, the feature cross-layer can discover the potential relationship between low-and high-order features at the same time.
Te binary cross-entropy loss is widely used in CTR prediction task, which is defned as follows: (16) where N is the number of samples in training set. y i and y i denote the ground truth and the predicted click probability, respectively. We defne y � σ(φ(x)), where φ(x) represents the model function given input features x, which contains user and ad information, and σ(·) is the sigmoid function to map y to [0, 1]. Te core of CTR prediction modeling lies in how to construct the model φ(x) and learn its parameters from training data. In this work, the prediction y will be compute by average operation from the multiple predictions in cross-layer.

Te Discussion of Feature Dual Supervision.
Multiple networks model can jointly learn from diferent features, so that it can result in improved accuracy for CTR prediction task [24,26,27]. Each fully connected network in the feature expression unit (FEU) has diferent ability to extract information for diferent features, so multiple networks are used to extract the same user or ad features to obtain multiple representations to strengthen the expression ability of the FEU. Multiple networks learning are a promising method to learn relationships among diferent features. However, these approaches deal with a limited number of characteristic tasks [24,25]. Terefore, in order to alleviate the limitation and combine multiple representations, we propose a feature supervision unit (FSU). Tis unit consists of a single fully connected network, which gains supervised access under a supervised vector as input. In the description of the feature expression layer of the FDSM model, the user and ad feature representations are extracted in the same way. When user feature vector is extracted in the FEU unit, the inputs of the user's FSU unit are the full-volume information of the ad and the mean pooling features of the user; similarly, when the ad feature vector is extracted in the FEU, the inputs of the ad's FSU unit are the full-volume information of the user and the mean pooling features of the ad. Due to the input characteristics of the supervision unit, the operation of supervision has a two-level meaning. Firstly, during the process of extracting the feature vector of the user, the user supervised vector contains ad fullvolume information. And the input of the supervision unit contains the full-volume features of the user when the ad features are expressed, which belongs to the characteristics of the opposite side and this way means dual. It shows that the efective representation of users is infuenced by the advertising information, and the efective representation of ads is infuenced by the features that users care about, which is the frst meaning of supervision.
Secondly, adding the same-side mean pooling feature vector for the supervised vector, the FSU unit can be used to discover the cross-information between users and ads in advance, making the supervision more sufcient, which is the second meaning of supervision. Te underlying meaning of the whole is that users' behavioral decisions are made under the information of the ad, while the extraction of efective information of the ad is expressed with the features that are concerned with the user. Terefore, the fusion process of the two levels is dual supervision for feature expression.

Audience Expanded by FDSM.
Tere are many methods of audience targeting in online advertising, such as geotargeting. In this paper, we focus on the user profle and the abundant behaviors to expand audiences. We train the FDSM model from an advertiser's point of view through a large collection of ad campaigns that involves a large number of seed and nonseeded users. Specifcally, given an ad a and a candidate pool C, the potential user set T ⊂ C is obtained according to the click rate, and then the ad a is displayed to these users through the ad delivery system. In order to search the potential audiences of an advertising campaign more efciently, we frst train the FDSM model in all advertising campaigns to obtain the prior model. After that, rich behavioral information of users, such as keywords queried by users and behavioral interests, is collected from the online advertising system. Te prior model is used to conduct microtraining on the new feedback log data of the ad a to obtain the customized model. Finally, the potential audiences of the ad are found by using the customized model. Te overall process is shown in Figure 2. Te online advertising system collects a large amount of campaign data and caches it in the ofine database. Te ofine data platform periodically processes the data generated in the past period, and then uses the data to train FDSM to obtain the prior model. Te online platform is responsible for processing the data in the recent period to get the feedback data of a certain ad campaign, and then the customized model of the campaign will be obtained according to the feedback information. Finally, the customized model is used to retrieve potential audiences closely related to the campaign in the pool of candidate users in the data management platform (DMP), and the Top-N audiences are ranked according to the CTR to implement the campaign. Te system transmits the users' feedback logs about the ads through the data highway to both the ofine database and the online feedback database, and the whole system forms a closedloop decision process.

Experiment
Tis section describes the experimental scheme in detail, including experimental datasets, comparison models, evaluation metrics, experimental details, comparisons results, ablation study, and discussion.

Datasets. Tencent
Lookalike dataset(https://algo.qq. com/archive.html): Te public dataset for Tencent Ads competitions in 2018 is based on the advertisers providing more than one hundred seed sets, which contain a large number of user characteristics and aim to expand potential audiences for these campaigns. To ensure the security of service data, all data is desensitized. Te whole dataset is divided into a training set and a testing set. Each advertisement has eight categorical features: ad ID, advertiser ID, campaign ID, creative ID, creative size ID, ad category, product ID, and product type. Each user contains 19 features: including age, gender, education, carrier, consumption ability, geographical location, house, type of Internet access, fve groups of interest categories, three groups of topics, and three groups of keywords.
MovieLens dataset (https://grouplens.org/datasets/ movielens/): Te public dataset contains 6,040 users; each of them consists of user ID, gender, age, occupation, and zip code, and holds 3,883 movies, each movie including movie ID, title, and genres. And it was rated by user with a score that among of 5 scale, and recorded timestamp for the rating behavior. In this study, in order to ft the audience targeting, we frstly according to the movie genres and the normalized years, the years were extracted from the movie title, to cluster all the movies into 50 groups thought k-means method. Each group was regarded as an ad group, and target audiences were found for each ad group. Meanwhile, in order to make the sample data suitable for CTR prediction task, we converted the rating data into a binary classifcation data [11]. Specifcally, we label the original rating with 4 and 5 to be seed users (labeled as 1), and the rest are recorded as nonseed users (labeled as 0). Finally, based on the sequence of the timestamps of each movie being rated by users, the sample of 80% rating numbers with the top time is used as the training set, and the rest is used as the testing set. Tis results in 80% training data and 20% testing data for each ad group. Training data in all ad groups are taken as training sets, and all test data constitute testing sets.
Statistics about the datasets are shown in Table 2. Te ratio of positive to negative samples in the Tencent Lookalike dataset is 1 : 20, while the ratio of positive to negative samples in the MovieLens dataset after processing is close to 1 : 1. Te training set and testing set corresponding to each ad/ad group contain both seed and nonseed users. Te users in the testing set are regarded as candidates for testing. In the system of audience targeting proposed in this paper, in order to obtain a prior model in the ofine data, as shown in Figure 2, we take the training set of Tencent Lookalike according to 50% of the positive and negative samples of each seed set as the ofine data and the rest data as online data. In this method, the ofine data and online data are obtained to simulate the whole audience targeting process. For the training set of MovieLens, we consider it as both ofine and online data.
In the two-tower model, the initial application feld is natural language processing. Tencent Lookalike dataset not only contains user profle information, but also text information such as search keywords, favorite topics, and interests of users. FDSM model is improved on the basis of the two-tower model, so it can be used as the data set of experiments. Te FDSM model proposed in this paper belongs to the CTR model, so it is necessary to add more data for experiments to show the advantages of this model. MovieLens is a dataset frequently used by the CTR model, and it is very convincing to use this dataset for experiments.

Baselines.
In online advertising audience targeting, we compare our proposed model FDSM with the following baseline methods.
FM [4] combines the advantages of support vector machine and factorization model, which has demonstrated its efectiveness in many CTR prediction tasks. MLP is popular structural model that embed each feature for a sample into an embedded vector, then obtains a dense embedding representation through concatenation operation, and feeds it into a fully connected network to automatically learn the CTR prediction. DeepFM [8] adds deep neural network as the deep part on the basis of FM model, so that the model can learn higher-order feature interactions. Te interaction terms of FM and the output of deep network will be model for CTR prediction. PNN [12] model applies a product layer after embedding layer and multiple fully connected layers to explore the high-order feature interactions. Two-tower [5] is a popular model in retrieval tasks. In this paper, the user features are input into the user tower and the ad features are input into the ad tower, by which the user and advertisement features are mapped into a shared semantic space. Te cosine function is applied to calculate the matching scores by the extracted expression vectors of user and ad. DCN [9], which proposed a deep cross-network to perform high-order feature interactions in an explicit way. In addition, it integrated a deep neural network. Te output from both networks to achieve CTR prediction task together.
Wide&Deep [7], which difers from the DCN model, it adds a "wide" part on top of DNN. As a general learning framework that combines a wide network and deep neural network to achieve the advantage of both. Te output of the last layer of DNN and "wide" part are inputted a linear model to complete the CTR prediction task. AFN+ [28], AFN applies logarithmic transformation layer to learn adaptive-order feature interactions. AFN+ further integrates AFN with a deep network.

Evaluation Metrics.
In this study, we use four metrics to measure the performance among of models. In the feld of CTR prediction, AUC (Area under ROC Curve) is a widely used metric [29], which refects the ranking quality of the prediction sample, and a higher AUC indicates a better CTR prediction performance. In this paper, according to the meaning of audience targeting, we calculate the AUC score for each ad in the testing set, and then calculate the average AUC of all ads, denoted as GAUC. In addition, we use the equation (16) to calculate the loss of audience prediction for each ad in the testing set; the smaller loss means better model performance. Te average value is the fnal test result, which is denoted as Logloss. We also apply another two metrics: Precision@K% and Recall@K% [22], which indicates that after retrieval from candidate users, the Top-N candidates is selected as the target user to calculate precision and recall rate of the model. Tey are defned as follows: where S denotes the set of the seed of a certain ad campaign, and T denotes the set of the top targeting audiences after predicted with number of K% × |S| by the audience expansion model for the ad. In this study, according to the approximate ratio of positive and negative samples in the whole dataset, we set K for Tencent Lookalike dataset and MovieLens dataset as 5 and 50, respectively.

Implementation Details.
In this section, we will introduce the experimental environment and parameter details. As for the experimental parameter details, for the sake of fairness, we set the ofine learning rate parameter λ off and the online learning rate parameter λ on for the training process of prior model and the customized model in the audience targeting system. Both of the learning rates of ofine and online stage about Tencent Lookalike, which are tuned from 0.00002 to 0.0002 and the step is 0.00002. While for the MovieLens, the ofine learning rate parameter is the same as the Tencent dataset, the online learning rate ranges from 0.0001 to 0.001 with a step of 0.0001. In the FDSM proposed in this paper, the parameters [m, n, k] of Feature Expression Unit (FEU) and the number of fully connect networks in the cross-layer are set as [6,6,5] in Tencent Lookalike and [8,4,8] in MovieLens, and the hidden layer structure of all fully connect networks networks in FSU and cross-layer is [128,64]. For all other models, the hidden layer structure of all networks is [256,128,64]. Te dimension of the embedding vector is 64. In the ofine experiment, the ofine full data is used for training, 8 ad campaigns are sampled and the minimum sample size is 512 for every ad in each generation, and the training times of the prior model is one epoch. In the online process, according to the order of the online training data of an ad campaign, 512 samples are applied in each iteration to train the prior model with three epochs to get the customized model. Te Adam [30] is applied as the optimizer to optimize network weight parameters both online and ofine. In the experiment, all models were coded in Python language on PyTorch l.6.0. We conduct our experiments with platform is CPU version of Intel Xeon Silver 4210 with 2.2 GHz. Te memory of the device is 32 GB, and GPU version is independent graphics card GTX2080Ti.

Model Comparisons.
In this part, we will analyze the experimental results from Table 3, Figures 3 and 4. Where in Table 3 are shown the ten prior models obtained for each model trained with all ofine learning rate parameters λ off , and each prior model obtains ten training customized models through diferent online learning rate parameters λ on . Finally, the optimal results are obtained for each model tested with one hundred custom models. Each coordinate of Figures 3 and 4 represents the average value that the ten customized models are obtained by training all prior models of each method under one λ on , then ten groups of test indicators under all parameters were calculated. From the results, we can summarize as follows: (1) As shown in Table 3, among the performance results of all CTR prediction models, the best performances are highlighted in bold, and the best baseline results are highlighted in the underline. As can be observed from the table, the proposed FDSM model has been improved in diferent degrees in Lookalike and MovieLens datasets. For example, on Lookalike dataset, FDSM surpasses the suboptimal MLP model over 1.03% on GAUC metric, and improves 2.83% and 2.86% for precision and recall, respectively, with lower Logloss metric values than the other baseline models. On the MovieLens dataset, the FDSM model outperforms the PNN model by 0.80% on the GAUC metric, comparing with the AFN+model, the corresponding precision and recall rates are improved by 0.62% and 0.67%, and the Logloss metric values are also lower than those of other baseline models. Tis demonstrates that the FDSM model can extract more fne-grained feature expression of the users and ads after supervised operation, so as to fnd accurate matching patterns between feature information in feature cross-layer. And it also shows the efectiveness of the FDSM model in CTR prediction tasks and audience prediction.
(2) In Figures 3 and 4, they present the average performance of the GAUC and Logloss metrics for all models in the online stage on both datasets. Figure 3 represents the results of the tests on the Lookalike dataset. In this fgure, we can clearly observe that the FDSM model outperforms the other baseline models in both metrics on average under all parameters of online learning rate. Figure 4 shows the online test results of all models on the MovieLens dataset. We can see that the average performance of the FDSM model on GAUC and Logloss does not reach the optimum when the online learning rate parameter is below 0.0002, but after 0.0002, the average performance outperforms the other baseline models and is able to achieve the global optimum average performance on the GAUC metric. As for the Logloss metric, the performance reaches the optimum after 0.0005. It further illustrates that the FDSM model outperforms than other baseline models in average performance. (3) On the Lookalike dataset, through the experimental results of Table 3, we can observe that the optimal performance of the FM model is lower than the other models, and the average performance refected from Figure 3 is also the worst, due to the fact that FM is a shallow structure, which can only be limited to second-order interaction. Other models all involve deep networks; second-order and higher-order feature interaction can be found at the same time. Terefore, the performance of them exceeds FM. Te optimal and average performance of the PNN model is also unsatisfactory, and one possible reason is that the model applies the inner product of features as part of the deep network input, and it is the same as the FM model, where the second-order feature interaction afects the deep network fnds higher-order information. Te performance of the AFN+model is close to PNN, and its logarithmic transformation layer is also applied to the advance feature interaction, only with an additional exponential factor, so its performance on this dataset is close to that of the PNN.
DeepFM difers from PNN in that the results of the second-order interaction do not participate in the deep network, but enter the linear model together with the output of the deep network and fnally get the prediction results, so this model performs better than PNN, but the experimental results are lower than Wide&Deep, MLP, and DCN models.
Wide&Deep combines the output of a shallow structure "Wide" and a deep network, and the prediction results are obtained through the linear model, while DCN is combined with a deep crossnetwork and a deep network. For MLP, the high optimal performance in the baseline model is obtained by using only the deep network, which shows the powerful capability of the model. Two-tower's performance is still lower, probably due to the use of Scientifc Programming  cosine function in cross-layer, and the low accuracy of matching between users and ads. Trough analysis, it can be seen that the performance of the model applied to explicit second-order feature interaction is low, such as FM, PNN, and DeepFM. Te reason may be that listing all cross-features, including irrelevant ones, will introduce noise feature combinations and reduce model performance [28]. However, the performance is better for models that do not involve explicit second-order feature interaction on Lookalike data, such as Wide&Deep, MLP, and DCN models.  Table 3 and average performance presented in Figure 4. PNN, AFN+, MLP, and tow-tower have higher optimal and average performances. From the above analysis, it is clear that FM is limited to second-order interaction and therefore has lower performance. While Wide&Deep, DCN, and DeepFM models all use a combination of the outputs of nondeep network and deep network, compared with PNN, MLP, and two-tower models that only use deep networks, their ability to match information is weaker in the shallow structure. Te AFN+model applies adaptive-order feature interaction as the shallow structure, its coefcient factors can be automatically adjusted and weaken the ability of the shallow structure; and it cooperates with the deep network to achieve CTR prediction, which indicates the strong performance of deep networks together. (5) Why do all the models perform so diferently on the two datasets? In the previous dataset introduction, we know that the Lookalike dataset contains not only the user profle information, but also the keywords and topic features of the user's query, which belongs to the text corpus, and if these features are not extracted, it will be difcult to directly perform feature interaction to discover the relationship between features, and even afect the ability of other structures, such as FM, PNN, and DeepFM. As for MovieLens dataset, there is only the user profle information and rating, and no text information. Terefore, feature interaction can be performed in advance and will not afect the performance of deep networks, such as PNN and AFN+ model. Trough the above, it can be seen that the models applied to deep networks can all achieve feature extraction and higher-order interaction of features at the same time. In the FDSM model, we propose the FEU, FSU units and cross-layer with bridge connection based on the two-tower model, all of which use the deep network, which not only strengthens the feature extraction ability, but also enables more efcient features interaction.

Ablation Study.
In ablation study, basing on the twotower model, we will conduct experiment to analyze the infuence on the number of fully connected networks in Feature Expression Units (FEUs) and feature cross-layer in the FDSM model, the Feature Supervision Units (FSU) and the cross-layer with bridge connection.
4.6.1. Ablation Experiment Setup. As described the implementation details, we still conduct experiments under the same setting of learning rate parameters, and each group of experimental achieves the optimal performance among 100 groups of results. It is mainly divided into two aspects; the specifc arrangement is as follows. Firstly, in order to explore the infuence of the number of feature networks in the two-tower model, we set the structure parameters of [m, n] as [6,6] on Lookalike dataset, while [8,4] on MovieLens dataset. Te cosine function on the feature cross layer was still used, which is denoted as multinetwork two-tower (MNTT). To study the infuence of Feature Supervision Unit (FSU) on the basis of MNTT, as introduced in the discussion of feature dual supervision, two levels of supervision need to be set, so the input of the supervision unit is replaced. In the frst method, the user supervision vector is set to e a , and the ad supervision vector is set to e u . Tis method is denoted as FDSM a . In the second method, we set the user supervision vector to e au and the ad supervision vector to e ua , which is denoted as FDSM b .
Secondly, to explore the efect of the proposed bridge connection unit in the feature cross-layer, we replaced the cosine function with k fully connected networks in the twotower model, where k is set to 5 on Lookalike dataset and 8 on MovieLens dataset. For the bridge connection in feature cross-layer, we will rebuild of the bridge vector I b , which was proposed in Section 3. In the frst method, the concatenation operation is applied to combine features expression vector of users and ads from their tower, respectively, that is I b � I u ⊕ I a , which is fed into k fully connected networks, denoted as TT a In the second method, the input vector of the fully connected networks through the bridge unit proposed in this paper, that is I b � (I u ⊙ I a ) ⊕ I u ⊕ I a , which denoted as TT b Trough these two methods, the efectiveness of the proposed feature crosslayer with bridge structure will be verifed.

Te Impact of the Supervision Unit.
In order to explore the impact of FSU, we set the FDSM a and FDSM b models based on MNTT for the study. Te reason for using the MNTT model instead of the two-tower model is to avoid the infuence of the number of diferent networks in the process of feature extraction. Another reason is that the softmax function is applied in the output of the FSU for the supervised weight factor, so the FSU unit fails when [m, n] is set to [1,1], which makes it impossible to make a fair comparison. As shown in Table 4, the performance of the FDSM a and FDSM b models exceeds that of MNTT on both datasets. It illustrates that the supervision unit can enhance the feature extraction capability of the FEU. It can also be found that the performance of supervision in the FDSM b method is signifcantly better than that in the FDSM a method, which indicates that the same-side information can discover the correlation between users and ads information in advance, and fnally obtain more fne-grained users and ads information in the feature extraction layer. Finally, through the above analysis, it can be clearly found that the multinetwork feature extraction unit and feature supervision unit proposed in this paper not only have strong feature extraction ability, but may also be able to adapt to diferent datasets.

Te Impact of Bridge Connection.
In order to explore the impact of multi-network with bridge connection in cross-layer, we will analyze the experimental results of twotower, TT a , and TT b . According to the results in Table 4, the performance of TT a and TT b are higher than that of Two-Tower, which indicates that multi-network is more accurate than cosine function in calculating the matching degree between user and ad. Te performance of TT a is lower than TT b , indicating that the bridge connection with the Hadamard product can better calculate the matching degree between two expression vectors. Terefore, the feature crosslayer with bridge connection proposed in this paper has excellent information matching ability.
Trough the introduction of this part, it is shown that the proposed FSU and feature cross-network with bridgeconnected module have high performance in CTR prediction. By summarizing all the results in Table 4, we can fnd that the performance of the FDSM model is better than that of the other CTR models, which indicates that the method of  combining the FSU and the feature cross-network with bridge connection has higher performance than either part.

Complexity Analysis Experiment.
In this study, experimental tests were conducted on Lookalike dataset to calculate the number of parameters and online running time of each model, as shown in Figure 5. In this picture, the x-axis represents the model and the y-axis represents the parameter scale. Te (×10 4 ) describes the unit of the number of model parameters and (s) describes the online running time of the model, whose unit is second. To clarify, W&D stands for Wide&Deep model and T-T stands for Two-Tower model.
First, it illustrates that the FDSM model proposed in this paper has the largest number of parameters because it applies multiple networks, while other models are all single network. Terefore, the FDSM model consumes more memory. Second, compared with baseline models, the online running time of FDSM is less than that of DeepFM and PNN, but more than other models. In general, in terms of the number of parameters and online running time, the model does not have an absolute advantage, but reaches closely the average performance. However, according to the results from Table 3, FDSM can achieve the best performance at the expense of certain memory and time. Hence, with the continuous improvement of hardware performance, FDSM model can be better applied to online advertising system.

Conclusion
In this paper, we propose a novel CTR model called the Feature Dual Supervision Model (FDSM) for advertising audience targeting system. Tis model is based on the twotower model, aiming at the shortcoming of the two-tower model in the process of feature extraction, the lack of information interaction between the towers leads to the loss of details in the feature. In the FDSM model, Feature Expression Unit (FEU) and Feature Supervision Unit (FSU) are designed. Te FEU unit is used to extract features from users or ads information to obtain a representation matrix with multiple feature representations. And the supervised weight vector is generated by the FSU unit. Ten the supervised weight vector is applied to achieve supervision of the representation matrix to obtain a unique representation. In addition, we propose feature interaction with bridge connection to fnd more efcient matching patterns for user and ad representation. Finally, we conducted a large number of experiments, and through comparative experiments, it is shown that our proposed FDSM model surpasses many classical CTR models, and it is also found that the FDSM model may be adapted to more diferent contexts. Te effectiveness of the proposed FEU, FSU, and bridge-connected cross-networks are illustrated by the ablation experiments.
In future work, we will further study the efects brought by diferent neural networks of FEU and FSU units, such as convolution neural networks (CNN) or adding attention mechanism in the networks. In addition, diferent designs will be made for the feature cross layer. By studying diferent bridge connection modules and exploring diferent network infuences, such as setting residual network (ResNet), so that the more efcient feature input is explored to achieve efcient matching of information in the feature cross-layer.

Data Availability
Te data can be obtained from the following link https:// github.com/haipengni/DataForAds. Te data are also available from the corresponding author upon request.