Retrieval-Based Factorization Machines for Human Click Behavior Prediction

Human click behavior prediction is crucial for recommendation scenarios such as online commodity or advertisement recommendation, as it is helpful to improve the quality and user satisfaction of services. In recommender systems, the concept of click-through rate (CTR) is used to estimate the probability that a user will click on a recommended candidate. Many methods have been proposed to predict CTR and achieved good results. However, they usually optimize the parameters through a global objective function such as minimizing logloss or root mean square error (RMSE) for all training samples. Obviously, they intend to capture global knowledge of user click behavior but ignore local information. In this work, we propose a novel approach of retrieval-based factorization machines (RFM) for CTR prediction, which can effectively predict CTR by combining global knowledge which is learned from the FM method with the neighbor-based local information. We also leverage the clustering technique to partition the large training set into multiple small regions for efficient retrieval of neighbors. We evaluate our RFM model on three public datasets. The experimental results show that RFM performs better than other models in metrics of RMSE, area under ROC (AUC), and accuracy. Moreover, it is efficient because of the small number of model parameters.


Introduction
Human click behavior prediction is important for recommendation scenarios in many online commercial services. In those scenarios, recommended items such as online commodities, advertisements, and videos are often displayed to end users for clicking. Service providers expect more clicks from end users on the recommended items because that means more revenue from advertisers, and it is helpful to improve user satisfaction [1]. Terefore, it is crucial to accurately predict the click-through rate (CTR) for these recommendation scenarios [1,2], as CTR can indicate the probability that a user will click on a recommended candidate.
CTR prediction relies on the analysis of historical click behavior data. Te click behavior data include mostly discrete and categorical features, such as user gender, user id, commodity category, commodity id, and other location or demographic information. Tus, it can be highly sparse and has complicated feature interactions. Generalized linear models are applied to predict CTR including logistic regression [3] and support vector machines [4], which have difculties to capture high-order feature interactions. Ten, factorization machines (FM) [5] were proposed to model order-2 feature interactions by the inner product of latent vectors between diferent features and achieve very promising performance. Based on FM, feld-aware factorization machines (FFM) [6] divide features with diferent felds and extend FM with additional feld-aware feature interactions. In addition, high-order factorization machines (HOFM) [7] were presented to model high-order (more than 2) feature interactions but are limited by high training complexity.
Recently, deep learning has made massive strides in many research areas obtaining state of art performance in computer vision [8], natural language processing [9], and many other domains [10][11][12]. In order to learn sophisticated feature interactions, deep neural networks were recently proposed to predict CTR [13][14][15][16][17]. Based on the feature embedding, frstly, the features are represented by one-hot vectors and embedded in low-dimensional dense vector. Wide&Deep [13] feeds these feature embeddings to a combination of linear model and DNN to achieve both loworder and high-order feature interactions. DeepCross [14] uses a multilayered residual network to prevent gradient explosion and vanishing problems when the depth of the network increases. DeepFM [15] combines FM and DNN through sharing feature embedding vectors. NFM [16] obtains second-order feature interaction vectors by FM and feeds them into fully connected layers. AFM [17] applies an attention mechanism to second-order interaction vectors, which model the importance of diferent interactions. Tese deep learning methods can represent low-order and highorder feature interactions well and thus obtain good performance for CTR prediction.
However, these methods above [5,[13][14][15][16][17], including FM and FM-based neural models, usually train the models and optimize the parameters through a global objective function, such as minimizing logloss or mean square error for all training samples. Obviously, they intend to capture global knowledge of user click behavior but ignore local information such as the most similar samples. Te local information has been considered in collaborative fltering based on the memory network [18,19]. But these methods only utilize user-item interaction information. In the CTR prediction task, there are some content and contextual information that they miss such as user demographics, time, and locations. Also, the training set is usually very large, and the local information from the training set changes for diferent testing samples; thus, efciency is critical for retrieving such local information.
In this work, we propose a novel approach of retrievalbased factorization machines (namely RFM) for CTR prediction, which enhances FM by retrieving similar samples from the training set as the neighbor-based local information. Specifcally, frstly, we train a standard FM model by the feature embedding and the second-order feature interaction embedding. Based on the sample representation of the second-order feature interaction embedding, we can retrieve similar samples from the training set for one given testing sample. In order to improve the efciency of retrieving similar samples from the large training set, we preprocess to partition the training set into multiple small regions by the clustering algorithm K-means. During the testing phase, we get the most similar region by computing the similarity between the testing sample and the center vectors of all regions. Ten, we retrieve the most similar samples as neighbors from the region and fnally enhance the FM model by fusing the neighbor-based local information and the original FM output via the weighted sum. We conduct extensive experiments on three public datasets to evaluate our RFM method. Te experimental results show that RFM outperforms FM and existing studies such as HOFM [7] and deep learning models including DeepCross [14], Wide&Deep [13], and DeepFM [15]. In addition, RFM has the same number of trainable parameters with FM, which is much smaller than those of other studies. Terefore, RFM is an efcient and efective approach for CTR prediction. Compared with the black box of deep neural models, RFM is also more explicable due to its simple and easily understood architecture.
In summary, this paper makes the following contributions: (i) We propose a novel approach of retrieval-based factorization machines (RFM) for CTR prediction, which can enhance FM by the neighbor-based local information (ii) We use the clustering technique to partition the large training set into multiple small regions for efcient retrieval of similar samples (iii) We conduct extensive experiments to evaluate RFM on three public datasets, and the experimental results show that RFM performs better than existing models and is efcient due to the smallest number of model parameters Te remainder of this paper is organized as follows. Firstly, we discuss related works in Section 2. Section 3 describes the embedding methods for FM. Afterwards, we describe the details of our approach in Section 4. Section 5 describes datasets, evaluation procedures, and evaluation results. Finally, we conclude our work in Section 6.

Related Work
CTR prediction is an important task of the recommendation domain [1,2]. In this section, we discuss the related work about traditional machine learning methods, deep learning models, and memory-based models in the recommender systems.
For CTR prediction, some traditional machine learning methods have been proposed in the early stage, such as support vector machine [4], Bayesian model [20], tensorbased model [21], linear regression [22], and decision tree [3]. After that, factorization machine [5] (FM) is proposed. It projects each feature into a latent vector and captures the second-order feature interaction information. Field-aware factorization [6] (FFM) and high-order factorization machine [7] (HOFM) are the enhanced factorization machine. FFM adopts feld to FM, and HOFM models high-order (more than (2)) feature interactions. For the general recommendation scene, collaborative fltering [23] (CF) is a traditional and fundamental method. Matrix factorization [24] (MF) which projects each user and item into a common low-dimensional space capturing latent relations is a famous method based on CF.

Computational Intelligence and Neuroscience
With the wide use of deep learning in various felds, many deep learning models have been proposed for CTR prediction [25][26][27][28][29]. Teir bottom-level structure is an embedding layer mapping categorical variables to lower-dimension dense vectors. DeepFM [15] combines FM and DNN through sharing embedding parameters to represent low-order and high-order feature interactions. AFM [17] and NFM [16] are based on second-order feature interaction vectors. NFM feeds these vectors into fully connected layers, and AFM applies an attention mechanism to these vectors to model the importance of diferent interactions. HoAFM [30] encodes high-order feature interactions into feature representations in an explicit manner. Besides, the convolutional click prediction model [25] uses a convolution neural network to process a matrix consisting of embedding vectors, and deep&cross network [26] combines cross network and deep network. Its cross network causes the degree of cross features to grow with layer depth. Product-based neural network [27,28] introduces a product layer to capture interaction information. Recurrent neural network for sequential click prediction [29], deep interest network [31], and deep interest evolution network [32] take advantage of users' history click behaviors sequence to predict CTR. Convolution neural network for CTR prediction in display advertisement [33] combines convolution neural network processing raw images in display and general deep network. Several deep learning methods are proposed in recommendation tasks. Tey are used for recommending video [34], music [35], and movies [36] improving collaborative fltering via deep learning. Generalized matrix factorization [37] and neural network matrix factorization [38] improve MF via deep learning.
Compared with the traditional machine learning methods and deep learning models above for CTR prediction, we consider the neighbor-based local information to enhance the FM method.
In the feld of collaborative fltering, there are also some studies [18,19,39] that introduce neighbor-based local information [40] to improve their methods based on the memory network [9,41], including collaborative memory network [18], multirelational memory network [19], and collaborative session-based recommendation machine [39]. Teir main idea is fusing a memory component and neural attention mechanism as the neighborhood component. Also, knowledge enhances the sequential recommendation [42] through integrating RNN-based networks with key-value memory network [43]. However, they usually only consider specifc feature interactions such as the user and item feature interaction, and our RFM method considers the content and contextual information and leverages the region partition to further improve the efciency and performance.
Our earlier work entitled "Retrieval-based Factorization Machines for CTR Prediction" in WISE 2021 presents the main idea of RFM. In this extended paper, we demonstrate more details on the design of RFM, including dropout, batch normalization, and the selection strategy on top-k neighbors. Besides, we analyze our RFM model by comparing RFM with existing FM-based neural models and collaborative fltering methods based on the memory network in the aspects of complexity and the cold-start problem analysis. Moreover, we evaluate the impact of more hyperparameters including the embedding size, the number of similar samples, and the similarity threshold via conducting extensive experiments. Finally, we present a more detailed analysis of related work.

Background
In this section, we provide the background of FM and FMbased neural models, including the feature embedding and second-order feature interaction embedding.
3.1. Feature Embedding. For the task of CTR prediction, the features of historical click behavior data typically have categorical felds (e.g., gender, commodity categories) and continuous felds after discretization (e.g., cost, age). Tese felds are usually converted to a set of binary features via one-hot encoding, making the original feature vectors highly sparse.
One common practice is encoding the sparse feature vectors to low-dimensional dense vectors by feature embedding. Given one sample x with n felds and the i-th feld vector x i (1 ≤ i ≤ n) via one-hot encoding. We map each feld vector x i to an embedding vector where W e is the latent factor matrix that can be learned in one end-to-end manner, and d is the embedding size. Ten, we denote the output of the feature embedding as follows: Te feature embedding technique has been adopted in Wide&Deep [13], DeepFM [15], and DeepCross [14] to reduce the data sparsity. Such embeddings are treated as the input of their models.

Second-Order Feature Interaction Embedding.
Besides the (frst-order) feature embedding described above, the second-order feature interaction embedding is also widely used in FM-based neural models including NFM [16] and AFM [17]. Tese methods feed feature embedding V(x) into the biinteraction layer [16] and obtain the second-order feature interaction embedding as follows: where ⊙ denotes the element-wise product of two vectors, that is, Compared with the feature embedding, the second-order feature interaction embedding can capture more knowledge of user click behaviors and has been proven to be more efective in CTR prediction [16,17]. Tus, we adopt the second-order feature interaction embedding in this work.

Retrieval-Based Factorization Machines
In this section, we describe the approach of retrieval-based factorization machines (RFM) for CTR prediction, which can enhance FM with retrieved neighbor-based local information. As shown in Figure 1, frstly, we train a standard FM to obtain the global knowledge of user click behaviors (Section 3.2) and obtain the second-order feature interaction embeddings; secondly, we partition the training set to different regions by a clustering algorithm based on the secondorder feature interaction embeddings (Section 4.2). Such regions can be used to efciently retrieve similar samples from testing samples and get the neighbor-based local information (Section 4.3). Finally, we enhance FM for predicting CTR by fusing the global and local information (Section 4.4).

Factorization
Machines. Similar to existing FM-based neural models [15][16][17] for CTR prediction, frstly, we train the feature embedding layer and the second-order feature interaction embedding layer. Instead of feeding the embeddings to upper neural models, we use them to build a standard FM model, as described in Section 3. Given the sample x as input, the predicted CTR is where w i represents the weights of feld vectors, and b is the bias. Te frst and second terms are the linear part, which refects the importance of frst-order features. Te third term represents the impact of the second-order feature interactions. Similar to FM [5], the third term can be reformulated by which not only reduces the computation complexity to O(n d) but also can be translated to matrix operation, which can be accelerated by GPU. Based on (4), existing works [15][16][17] usually train FM and optimize model parameters through a global objective function such as minimizing the global mean square error. Tus, obviously, FM intends to capture the global knowledge of user click behaviors in the training set but ignores the local information such as the most similar samples in the training set.

Region Partition.
In order to obtain the local information of one given sample in the testing set, we try to retrieve the similar samples from the training set as its neighbors based on the second-order feature interaction embedding (section 3.2). However, the training set is often very large. Tus, it will incur much overhead to directly compute the similarities among the training set. To solve this problem, as preprocessing, we adopt K-means [44], a classical clustering algorithm, to partition the training set into multiple regions. Ten, we leverage these regions to accelerate the retrieval of similar samples. Te clustering algorithm runs only once, and its result can be used for all testing samples.  Computational Intelligence and Neuroscience Specifcally, given all samples X in the training set and the i-th sample X i , we get the representation of sample X i based on the second-order feature interaction embedding by where we adopt batch normalization (BN) [45] to normalize the embedding S(X i ) and keep the distribution of emb i consistent. Similar to (5), we reformulate S(X i ) to improve the efciency as follows: Based on the representation emb i , we adopt the popular clustering algorithm K-means [44] to partition all the samples in the training set to multiple regions. In the K-means algorithm, we compute the Euclidean distance between sample representations and obtain k regions as follows: where C is the set of the sample regions, and U is the set of center vectors for diferent regions. Each sample in the region c i is a tuple described as (x, y), where x is the emb i vector representation, and y is the corresponding label. All the regions are disjoint, and their union is the whole training set. k is the number of regions and can be manually tuned. After clustering, we partition all samples in the training set to k regions, and the center vectors u i can represent the characteristics of all samples in one same region. We fnd the most similar region based on the center vectors and then retrieve the similar samples from the region. In this way, we reduce the computation complexity of retrieving similar samples among the whole training set.
Intuitively, our retrieved similar samples may not be the most similar ones from the whole training set and probably decrease the performance since we adopt the center vectors to represent all the samples of same regions. However, the clustering technique reveals the intrinsic nature and regularity [46,47] in the training set, and the most similar samples retrieved from the same region may contain more efective and generalizable information than those from the whole training set. Tat is, why partitioning into more than one region may lead to better performance than not partitioning, which is observed in our experiments (section 5.7.1). Terefore, such partitioning not only increases the retrieval efciency but also improves the performance.

Neighbor-Based Local Information.
Based on the disjoint regions of the training set, we introduce an efcient approach to retrieve similar samples as the neighbors for one given testing sample. Instead of computing the most similar samples directly from the large training set, we get the most similar sample region by calculating the similarity between the center vectors of regions and the representation of the testing sample. Ten, we retrieve the most similar samples as neighbors from the region. Finally, we choose top-t(t ≥ 1) neighbors with the most similarities to capture more local information and adopt the similarity threshold to flter out possible noisy neighbors.
Specifcally, we frst measure the similarity between sample X i and X j based on their representations by where dist ed (emb i , emb j ) represents the function computing Euclidean distance between emb i and emb j . Te smaller the distance between two samples is, the higher the similarity between them is. Ten, we can get the most similar sample region as follows: where emb 0 is the representation of the given sample, and g is the index of the most similar sample region in the region set C.
We fnally show how to retrieve top-t neighbors with the similarity threshold r from the region c g in Algorithm 1.
As shown in Algorithm 1, the threshold r is used to flter out possible noisy neighbors (line 2). Te output N is a list of tuples (sim, y) containing the similarity and labels of neighbors. Te function selectTop in line 8 selects top-t similar samples from neighbors, and it will be discussed in section 4.5.4. Obviously, the more similar the neighbors are, the more sufcient the information provided by neighbors will be. Te high similarity threshold will flter out some useful neighbor-based information. On the contrary, the low similarity threshold will introduce noise that makes side efect on prediction. Additionally, the number of selected neighbors t also infuences the performance. Te impact of the similarity threshold r and the neighbor number t will be discussed in Section 5.
Compared with the global knowledge from all the training sets, the retrieved neighbors are only a small subset. But they usually represent the common knowledge of these similar click behaviors, which can be treated as the local information of the given testing sample.

Enhancing FM with Local Information.
To improve the FM model, we fuse the retrieved neighbor-based local information N and the global information y g (x) provided in section 4.1.
Specifcally, we add the weighted sum of neighbor information to the original FM output and normalize the result as follows: where t is the number of retrieved neighbors, and y i and sim i are the labels and similarities of neighbors, respectively. For balancing the global and the local efect, we add a factor β to Computational Intelligence and Neuroscience control the efect of the local information, which can be manually tuned. Since the range of similarities between the given sample and other samples in the training set is between 0 and 1, thus we can also change β from 0 to 1. If the neighbor similarities are relatively small, we can turn up β.
On the contrary, we turn down β to consider less local information.

Training and Testing.
Since the joint training for all the samples in the training set and their corresponding neighbors are very expensive, we only train a standard FM model and fuse neighbor-based local information and the original FM output during testing. In the training phase, we use one global objective function to update trainable parameters for standard FM. Ten, we can obtain the secondorder feature interaction embeddings for representing samples and region partition. Finally, during the testing phase, we retrieve the neighbors and enhance FM by fusing the original FM output and the neighbor-based local information.

Objective Function.
FM can be applied to various prediction tasks, including regression, classifcation, and ranking. In our task of CTR prediction, we adopt the widely used objective function square loss: where X represents the set of instances for training, and y(x) represents the target of instance x.

Dropout.
We use emb i to represent any sample X i , and its dimension is d. If we assign a large value to d, it may lead to overftting. In order to alleviate this problem, we introduce the technique of dropout [48] for training. Dropout is a regularization technique to avoid overftting. Its idea is to drop neurons randomly during training. Only part of the model parameters which contribute to the prediction of y g (x) will be updated in each iteration. In the testing phase, dropout is disabled, and all parameters are used for estimating y g (x).

Batch Normalization.
As described in Section 4.2, we normalize the second-order feature interactions embedding vectors through batch normalization [45] to keep the distribution of emb i consistent. BN normalizes inputs to a zeromean unit-variance Gaussian distribution. Formally, given an input vector X i ∈ R d and all input vectors to the layer of the mini-batch be B � X i , the BN normalizes X i as follows: where μ B , σ 2 B denote the mini-batch mean and variance separately, and c and β are trainable parameters to scale and shift normalized value to restore the representation power of the model. BN is applied in both the training and testing phases in our RFM model.

Selection of Top-t Neighbors.
In the testing phase, we fuse neighbor-based local information and the original FM output online. Te bottleneck is how to efciently select topt neighbors from the most similar sample region. We briefy discuss three alternative methods. Te frst is sorting neighbors by similarity in the descending order and selecting the frst t neighbors. Te second is quick selection by adopting the idea of divide and conquer; that is, we swap samples by comparing pivots in subinterval recursively until the length of the subinterval is equal to t. Te elements in the subinterval are the results. Te third is using the priority queue implemented by the heap. We can build a priority queue with t size, push neighbors to the queue, and the queue will pop neighbors with small similarity dynamically. When going through all neighbors, the neighbors in the priority queue are the results. In this work, we adopt the quick sort algorithm to select the top-t neighbors for good efciency.

Comparison with Existing Models.
We compare our RFM model with existing FM-based neural models and collaborative fltering methods based on the memory network in the aspects of complexity and the cold-start problem [49].

Complexity Analysis.
Te scale of trainable parameters of our RFM model is much smaller than neural models, including NFM [16], DeepFM [15], Wide&Deep [13]. Te parameters' number of the embedding layers is n × d, and the linear weights parameters in global output are n. Te bias in global output and two parameters in batch normalization are constant; thus, we omit them. In the testing phase, we need to store neighbors retrieved from similar samples, which take t × d storage units. Tus, the space complexity of our model is O((n + t)d + n). Te deep learning models mentioned above have not only embedding layers but also plenty of fully connected layers. Tus, the number of trainable parameters in these models increases exponentially with the number of layers.
In computation complexity, we reduce the complexity of computing global output y g (x) to O(n d). In the testing phase, we compute the similarity between a given sample and samples from the most similar region in O(dN), where N is the number of samples in the region. After that, we select the top-t neighbors in O (N log N). Te computation complexity of our model is O ((n + N)d + N log N). In the deep learning method, the complexity of computation also increases exponentially with the number of layers.

Cold-Start.
It is difcult to conduct personalized recommendations without enough historical data of users (i.e., the cold-start problem [49]), which is common in recommender systems and has been studied for a long time [50][51][52]. 6 Computational Intelligence and Neuroscience Existing memory-based models including collaborative memory network [18] and multirelational memory network [19] also leverage the idea of fusing global information and local information, but both models only use the user and item interaction information. When a new user comes, they cannot map it to an efective vector due to the lack of historical click data for the user. By contrast, our RFM model takes full advantage of user demographics, which can be easily obtained such as the registry information and the contextual information like the time and location. Tus, it can capture efective feature interaction information and neighbor-based local information for CTR prediction. In this way, our RFM model can adapt to the cold-start scenario better.

Evaluation
In this section, we conduct extensive experiments to evaluate our RFM approach on three public datasets. We frst show the superior performance of RFM and analyze the efectiveness of neighbor-based local information. We also investigate the impact of hyperparameters, in particular, the number of regions in partitioning (Section 4.2) and neighbor-based local information (Section 4.3).
(i) Frappe Dataset: Tis dataset is often used in the context-aware recommendation. It contains 96,203 app usage logs of users under diferent contexts. It contains eight context variables except for user ID and app ID, which are all categorical, including weather, city, and daytime. We convert each log (user ID, app ID, and context variables) to a feature vector via one-hot encoding. (ii) MovieLens Dataset: Tis dataset has been used for personalized tag recommendation. It contains 668,953 tag applications of users on movies. We also convert each tag application (user ID, movie ID, and tag) to a feature vector. (iii) Criteo Dataset: Tis dataset includes 45 million users' click records and has 13 continuous features and 26 categorical features. It has been widely used for the display advertising challenges. We discretize the continuous features and convert them by using a tool provided in the Kaggle challenge (https:// github.com/ycjuan/kaggle-2014-criteo).
For Frappe and MovieLens, if one log is assigned a label of value 1, we treat it as "clicked" which means that the user has used the app under the context or applied the tag on the movie. We randomly select the logs representing that the user does not use the app or the tag is not applied on the movie as negative samples and assign −1 to their labels.
Finally, we get 288,609 and 2,006,859 samples, respectively. We randomly split each dataset into three parts: 70% for training, 20% for validation, and 10% for testing. We use the validation set for tuning hyperparameters and the testing set for performance comparison. For Criteo, we make random sampling and get 458,406 samples. We also split them into training, validation, and testing parts using the same ratio.

Evaluation Metrics.
We adopt root mean square error (RMSE), area under ROC (AUC), and accuracy to evaluate the performance, which are popular evaluation metrics in the tasks of explicit rating commendation [54] and clickthrough rate prediction [55].
where X represents the set of instances for testing, N is the number of instances, and y(x) and y(x) represent the predicted value and the ground-truth label of a instance x. A lower RMSE score indicates a better performance. AUC is insensitive to the classifcation threshold and the positive ratio. AUC's upper bound is 1, and a larger value indicates a better performance. It refects the sorting quality of the model.
Accuracy is the proportion of the samples that are predicted correctly. A larger value indicates a better performance.
In addition, we use the number of trainable parameters (Param#) to measure the complexity of diferent models and the training efciency. If a model has a smaller number of parameters, the training time will cost less.

Implementation Environment.
We develop the RFM model by using Python programming language, and Table 1 demonstrates the specifcations of the environment in which the model was trained.

Baselines.
We compare RFM with the following competitive methods that are designed for sparse data and CTR prediction in recommender systems.
(i) FM [5]: FM has shown a good performance for personalized recommendation and context-aware prediction, and it can efectively capture secondorder feature interaction information. Tis is the infrastructure of many deep neural network models. We use the ofcial C++ implementation (https://www.libfm.org/) for FM.
(ii) HOFM [7]: Tis is the enhanced version of FM, which can capture high-order feature interaction information. We use the TensorFlow implementation of the high-order factorization machines. (iv) DeepCross [14]: Tis model concatenates embedding vectors, followed by a multilayered residual network. With the residual structure, the network can prevent gradient explosion and vanishing problem when the network deepens.
(v) DeepFM [15]: Tis model consists of one FM component and one deep component. It combines the power of factorization machines and deep learning to emphasize both low-and high-order feature interactions. Two components share the embedding parameters.
(vi) HoAFM [30]: It uses a cross interaction layer to update a feature's representation by aggregating other cooccurred features and performs a bit-wise attention mechanism on the granularity of dimensions. (vii) PIN [28]: Tis method extends FM with kernel product methods to learn feld-aware feature interactions and adopts a feature extractor to explore feature interactions to tackle the insensitive gradient issue.

Performance Comparison.
Based on our investigation about parameters in the validation set, we set the default values of parameters in our RFM method. We set the embedding size d to 256 and the factor β to 1 by default in three datasets. Te default value of the similarity threshold r in Frappe and MovieLens is 0.8, and it is 0.2 in Criteo. Te top-t values are set as 6, 1, and 11 by default in the datasets of Frappe, MovieLens, and Criteo, respectively. Te default value of the region number is 2 in Frappe and MovieLens, and it is 2 6 in Criteo. We will demonstrate how to obtain those values in Section 5.7. We set the initial learning rate as 0.01 and use Adagrad [56] as the model optimizer for RFM since Adagrad can adapt the learning rate during the training phase and ease the work of assigning a proper learning rate. For the other methods or models, we use the default learning rate confguration referred in their source codes or their articles.
We compare the performance of our RFM method and diferent baselines. Table 2 summarizes the performance and the scale of trainable parameters obtained on embedding size 256.
According to Table 2, we have the following observations: (i) RFM has the same scale of trainable parameters as FM. However, RFM performs better than FM by a 7.8%, 7.0%, and 1.6% average improvement in RMSE, AUC, and accuracy separately. Tis demonstrates the efectiveness of neighbor-based local information which enhances the original FM. (ii) HOFM uses a separated set of embeddings to model high-order feature interactions and achieves better performance than FM in the dataset of MovieLens and Criteo. However, the performance of HOFM is worse than that of FM in the dataset of Frappe. Te reason is probably that although the high-order (more than 2) feature interactions can provide useful information, they also introduce noisy information simultaneously. Also, HOFM doubles the scale of parameters and incurs more training overhead. (iii) Wide&Deep and DeepCross take the feature embedding (Section 3.1) as the input of deep neural networks, which may miss the second-order feature interaction information if the embedding parameters are not initialized by pretrained FM model [16]. Tus, both of them almost have the worst performance. Furthermore, Wide&Deep and DeepCross have the most parameters. (iv) DeepFM can combine the knowledge from the feature embedding (Section 3.1) and second-order interactions embedding (Section 3.2) in the FM component, as well as high-order feature interaction from the deep component with sharing embedding parameters. Tus, generally, DeepFM has a good performance in the three datasets. However, sometimes, it performs poorly especially for the metric of RMSE in the MovieLens dataset, because MovieLens has only three felds and DeepFM may not capture enough useful feature interaction information. Besides, the deep component of DeepFM Input: c g , emb 0 , r, t Output: the list of (y i , sim i ): N (1) initialize a container neighbors; (2) for (x, y) in c g do (3) α � sim(x, emb 0 ) (4) if α ≥ r then (5) add (y, α) to neighbors; (6) end (7) end (8) N � selectTop(neighbors, t); (9) return N; ALGORITHM 1: Te algorithm of neighbor retrieval. 8 Computational Intelligence and Neuroscience will lead to more training parameters and decreases the efciency. (v) HoAFM captures the high-order feature interactions in an explicit manner with the attentive FM, which is comparable to our RFM in the metric of AUC and accuracy for the dataset of MovieLens, but has a worse performance in other datasets. (vi) PIN obtains a good performance in the dataset of Criteo but performs poorly in the datasets of Frappe and MovieLens, which shows that sometimes the adaptive embeddings learned by the kernel product may be not efective.
Overall, our proposed RFM model achieves the best performance among these models in RMSE and AUC due to the enhancement of neighbor-based local information. RFM also has the same number of parameters with FM, which can achieve the best training efciency.

Te Efectiveness of Local Information.
We take four examples from the testing sets to qualitatively analyze the efectiveness of neighbor-based local information in our approach. Table 3 shows the four examples. In addition, we also show the percentages of testing samples where neighbors correct or worsen the output of FM for the three datasets in Table 4.
In Table 3, the frst column is the ground-truth label of the given samples, and the second column is the original FM output of y g (x). Te third column represents the similarities between given samples from testing sets and their corresponding neighbors retrieved from the training set. Te forth column is labels of neighbors. Te last column is the fnal output by fusing the local knowledge and the original FM output. Obviously, the original FM outputs y g (x) deviates from the true labels for the four examples, which means that the global knowledge learned by FM cannot model the click behavior correctly for these four examples. However, the neighbors from the training set, whose similarities are more than 0.8 (in the third column), can provide useful local knowledge with correct labels (as shown in the fourth column). Ten, we use such local knowledge to correct the original FM output y g (x) and obtain the fnal results (as shown in the last column). Intuitively, in the real scenario, original FM may predict that a user would not like to click one item because most of users in the training set dislike to click. But several other users who have similar characteristics to the user clicked the item, and the user also intends to click it with a high probability. In this way, the neighbor-based local information can represent the personal and preference knowledge and is efective to enhance the FM model.
Besides, we further record the percentages of three types of results from the testing sets, as shown in Table 4. Te keywords "Better" and "Worse" mean that the fusion result is better or worse than y g (x), and the symbol "Equal" represents that the fusion result is the same as y g (x). We can see that RFM corrects most of the mistakes of y g (x) in the testing sets and has a small negative impact at the same time.
In the datasets of Frappe and MovieLens, the percentages of the worse cases are much smaller. In Criteo, the percentage of the worse cases is relatively higher than the other two datasets. However, the degree of positive cases (better, 63.17%) is much bigger than that of negative ones (worse, 35.87%). Terefore, RFM enhances the overall performance of FM in Criteo. Te percentages of three types of results further illustrate that the neighbor-based local information captured by RFM is efective.

Te Number of Regions.
We partition samples in the training set into multiple regions for the sake of efciency and performance. It can not only accelerate the process of retrieving similar samples but also introduce the better region features. Te former increases the efciency, and the latter improves the efectiveness. We assign 2 n to the region number k, where n is from 0 to 7 with step size 1. Figure 2 shows the infuence of the region number on the performance of RFM for diferent datasets. We can see that the region partition can infuence the performance, and it may improve the performance to some extent with the proper numbers of regions. Without region partition (i.e., n � 0 and the region number is 1), the model will retrieve neighbors by traversing all samples in the training set. We can observe that the performance is always not the best.
When partitioning samples into multiple regions, each region has its own center vector for representing the common characteristics of samples in it. Figures 2(a) and 2(b) show that partitioning samples into two regions can have the best performance. Continuing to increase the region number will reduce the efectiveness, since dividing samples into too many regions may weaken the ability of a region to represent common characteristics of samples belonging to it. In Figure 2(c), we have the best performance when n � 6, and the curve fuctuates frequently, which indicates that the characteristics of samples in Criteo are highly diverse. Intuitively, the retrieved similar samples may not be the most similar ones from the whole training set when the region number is more than 1 and probably decreases the performance. However, we can see that the region partition can improve the performance as well. Tat is because the clustering technique can reveal the intrinsic nature and regularity in the training set [46,47], the most similar samples retrieved from the same region may contain more efective and generalizable information than those from the whole training set.
We also measure the efciency of our RFM method for diferent region numbers. In Figure 3, we show the average prediction time (APT) for diferent region numbers, where APT is the average time to predict one sample in the testing set. For clarity, we take the natural logarithm of APT. As shown in Figure 3, the increase of the region number reduces APT roughly in a linear relationship, since the region partition can decrease the target samples for the retrieval of neighbors by the rate of the region number. If the region number keeps the same, generally, the APT depends on the sizes of the training sets in diferent datasets. For example, the size of Movielens dataset is the largest, and then, it needs more time to retrieve neighbors than the other two datasets.
When the region's number is more than 2 5 , the APT of Criteo is less than that of Frappe, which is probably because diferent regions have diferent numbers of samples. In Criteo, the number of samples in the most similar region is smaller than those in Frappe, although it has more samples in the whole training set. In summary, when considering the performance and prediction efciency together, we can fnd the best choice of the region number that has a good trade-of between the performance and efciency for one specifc dataset. For example, we can choose n � 6 as the best region partition for the Criteo dataset since it has the best performance with much high prediction efciency.

Embedding Size.
Te size of second-order feature interaction embedding may have an impact on the performance. As shown in Figure 4, we evaluate our RFM model and baselines in the Frappe dataset with diferent embedding sizes. For highlighting the sensitivity, we set a small step by 16 and show a small range of embedding sizes around 256 when RMSE changes. We fnd that RFM achieves the best performance compared to other methods for all experimental embedding sizes. Among them, when the embedding size is 256, RFM can have the best performance. We only show the RMSE for diferent embedding sizes in the Frappe dataset since the other two datasets have the similar trend.

Top-t Similar Neighbors.
Te hyperparameter t determines the number of neighbors as the local information for fusion. Figure 5 shows the RMSE of our model with regard to diferent top-t values.
As shown in Figure 5, the impact of top-t values on diferent datasets is diferent. Te curve in Figure 5(a) is overall convex and local oscillating. Too small t cannot introduce enough local knowledge from neighbors, and too large t may introduce noisy neighbors. Ten, we have Top-6 as an optimal value. Figure 5(b) shows that RMSE increases with the rise of top-t. Tis is because the number of felds in Movielens is only three, and neighbors contain less feature interaction information. When the number of neighbors increases, it may introduce more noisy data. By contrast, RMSE decreases with the rise of top-t in Figure 5(c), since the number of felds in Criteo is the most, and neighbors contain the richest and useful interaction information for the following fusion. Te curves in Figures 5(b) and 5(c) become horizontal near the end due to the limitation introduced by the similarity threshold.

Te Similarity Treshold.
Te similarity threshold r is a key hyperparameter to flter out noisy neighbors. We investigate the infuence of r on RMSE. Figure 6 shows the result.
As shown in Figure 6, if the similarity threshold is too low, it will introduce more noisy data and decrease the efectiveness of our model. If the similarity threshold is too high, it will flter out some valuable knowledge and decrease the efectiveness as well. In Frappe and Movielens, 0.8 is the best similarity threshold, which has a good trade-of between fltering out noisy data and introducing valuable knowledge. Te RMSEs before 0.8 are stable since we always select the same top-t neighbors. For the curve of Criteo, 0.2 is the best similarity threshold since the sample similarities are relatively small.   Computational Intelligence and Neuroscience

Conclusion
How to predict click-through rate (CTR) accurately is an important problem in many recommendation scenarios. In this work, we proposed a novel solution called retrievalbased factorization machine (RFM), which aims to predict CTR by combining global knowledge learned from the FM model with the neighbor-based local information. We conducted experiments on three public datasets to evaluate RFM, and the experimental results show that our RFM model outperformed existing models with a simple and efcient architecture. Te results also indicate that using local information properly can enhance the overall performance of CTR-predicting tasks.
More generally, the idea of fusing global and local information in this paper can be applied in other domains, including some dense data tasks. Tere are two interesting directions for the future study. One is exploring strategies to retrieve efective neighbors and flter out noisy data more efciently and accurately. Te other is how to combine the local information with the original FM model better to predict human click behavior more efectively.

Data Availability
Te datasets used in this study are included within the article.

Conflicts of Interest
Te authors declare that they have no conficts of interest.