Attention with Long-Term Interval-Based Deep Sequential Learning for Recommendation

,


Introduction
Due to the increasing abundance of information on the Web, helping users filter information according to their preferences is more and more required, which emphasizes the importance of personalized search and recommendation [42][43][44][45]. Traditional methods for providing personalized content, such as item-item collaborative filtering [33], did not take into account the dynamics of user behaviors, which are recently recognized as important factors. For example, to predict the user's next action such as the next product to purchase, the profiling of both long-term preferences and short-term intents of user is required, where modeling the user's behaviors as sequences provides key advantages. Nonetheless, modeling sequential user behaviors with temporal dimension raises even more challenges than modeling them without the temporal dimension. How to identify the correlation and dependence among actions is one of the difficult issues. is problem has been studied extensively, and many methods based on the Markov process assumption, such as Factorizing Personalized Markov Chain [32] and Hierarchical Representation Model [41], have been designed and adopted in different tasks [11,15]. ese methods usually focus on factor models, which decompose the sparse user-item interaction matrix into lowdimensional matrices with latent factors. However, for modeling sequential information, it is often not clear how to integrate the dynamics of user intents into the framework of factor models.
Recently, neural-network-based algorithms have received much attention from researchers [6,9,18,30,48]. For instance, many different kinds of graph neural networks (GNNs), instead of matrix factorization (MF) based algorithms [36], have been proposed to learn graph embedding due to their ability of learning from non-Euclidean space [30]. Many different kinds of recurrent neural networks (RNNs) have been proposed to model user behaviors due to their powerful descriptive ability for sequential data [14,29,39,47,52]. For example, Hidasi et al. [14] propose an approach based on a number of parallel RNNs with rich features to model sequential user behaviors. Wu et al. [47] endow both users and movies with a long short-term memory (LSTM) autoregressive model to predict user future behaviors. Furthermore, to utilize the temporal information better, Zhu et al. [52], Neil et al. [29], and Vassøy et al. [39] introduce time intervals between sequential actions into RNN cells to update and forget information rather than only considering the order of actions.
Despite the success of abovementioned RNN-based methods, there are several limitations that make it difficult to apply these methods into the wide variety of applications in the real world. One inherent assumption of these methods is that the importance of historical behaviors decreases over time (e.g., equation (15) in [52]), which is also the intrinsic property of RNN cells such as gated recurrent units (GRU) and long-and short-term memory (LSTM). However, this assumption does not always apply in practice, where the sequences may have complex cross-dependence [46]. For example, user's online actions are not straightforward but contain much noise and randomness. See Figure 1 for illustration, which shows a real sequence of clicked items of a user in one of the largest online shopping websites. We can conjecture the user intended to buy a T-shirt in honor of Lebron James and he/she finally bought Item 6 . But the user also viewed items of different kind (shoes as Item 3 and Item 4 ). Clearly, Item 1 and Item 2 are more important than Item 3 and Item 4 to predict the final deal, although the former two items are earlier than the latter two items in the temporal dimension. is example displays the difficulty in analyzing sequential user behaviors, where the simple assumption of time interval-based correlation between actions is not enough to cope with.
In this paper, we are inspired by the attention mechanism proposed for natural language processing [1,49] which achieves remarkable progress in the past few years. Attention mechanism introduced into deep networks provides the functionality to focus on portions of input data or features to fulfill the given task. Similarly, we expect that a trained attention mechanism helps to identify the important correlated actions from sequential user behaviors to make prediction. However, the existing attention mechanism is inefficient in modeling sequential user behaviors. Hence, we design a new attention mechanism specifically for our purpose.
Specifically, we propose a network featuring Attention with Long-term Interval-based Gated Recurrent Units (ALI-GRU) for modeling sequential user behaviors to predict user's next action. e network is depicted in Figure 2. We adopt a series of bidirectional GRU to process the sequence of items that user had accessed. e GRU cells in our network consist of not only normal GRU but also time interval-based GRU, where the latter reflects the short-term information of time intervals. In addition, the features extracted by bidirectional GRU are used as the input of attention model, and the attention distribution is calculated at each timestamp rather than as single vector as in Seq2Seq model [1,46]. erefore, this attention mechanism is able to consider the long-term correlation along with short-term intervals. Our designed attention mechanism is detailed in Section 4.
We have performed a series of experiments using both well-known public datasets (LastFM and CiteULike [52]) and the dataset collected by ourselves, built from real-world data. Extensive results show that our proposed ALI-GRU outperforms the state-of-the-art methods by a significant margin on these datasets. Moreover, ALI-GRU is adopted online and we have performed online A/B test; test results further demonstrate the practical value of ALI-GRU in comparison with the well-optimized baseline in a real-world e-commerce search engine.
is paper makes the following contributions: (i) First, we propose a bidirectional time interval-based GRU to model the long-and short-term information of user actions for better capturing temporal dynamics between actions. Time interval-based GRU is able to effectively extract short-term dynamics of user intents as driven signals of attention function and refine the long-term memory by contextual information. (ii) Second, we design a new attention mechanism to encode long-and short-term information and identify complex correlation between actions, which attends to the driven signals at each time step along with the embedding of contextual information. is mechanism is less affected by the noise in the historical actions and is robust to extract the important correlated information between sequential user behaviors to make a better prediction. (iii) ird, we conduct a series of experiments on two well-known public datasets and a large-scale dataset constructed from a real-world e-commerce platform. Extensive experimental results show that our proposed ALI-GRU obtains significant improvement compared to state-of-the-art RNN methods. In addition, ALI-GRU is adopted and we conducted online A/B test, and the results further demonstrate the practical value in a real-world e-commerce search engine. e remainder of this paper is organized as follows. Section 2 discusses related works. e problem of modeling sequential user behaviors is formulated in Section 3, followed by detailed description of our proposed ALI-GRU in Section 4. Experimental results are reported in Section 5 and concluding remarks in Section 6.

Related Work
We give brief overview of related work at two aspects, modeling of sequential user behaviors and attention mechanism.

Modeling Sequential User Behaviors.
Due to the significance of user-centric tasks such as personalized search and recommendation, modeling sequential user behaviors has attracted great attention in both industry and academia. Most of the pioneering work relies on model-based Collaborative Filtering (CF) to analyze user-item interaction matrix. ere have been a variety of such algorithms including Bayesian methods [28] and matrix factorization (MF) methods [31,50]. Due to the characteristics of sequential information, several CF works take the temporal dynamics into account, often based on the assumption of Markov processes [11,21,32]. For the task of sequential recommendation, Rendle et al. [32] propose Factorizing Personalized Markov Chain (FPMC) to combine matrix factorization of user-item matrix with Markov chains. He and McAuley [11] further integrate similarity-based methods [20] into FPMC to tackle the problem of sequential dynamics. e major problems of the abovementioned work are that these methods independently combine several components, rely on low-level hand-crafted features of user or item, and have difficulty to handle long-term behaviors. On the contrary, along with the development of deep neural networks, Lei et al. [22] and Zheng et al. [51] employ deep learning to learn effective representations of user/item automatically. Furthermore, with the success of recurrent neural networks (RNNs) in the past few years, a paucity of work has made attempts to utilize RNNs [14,25,47]. For example, Liu et al. [25] consider jointly the contextual information such as weather into the RNN architecture to improve modeling performance. e insight that RNNbased solutions achieve success in modeling sequential user Pooling Softmax Attention distribution Attention function Complexity behaviors is that RNN has well-demonstrated ability of capturing patterns in the sequential data. Recent studies [10,29,39,52] also indicate that time intervals within sequential signal are a very important clue to update and forget information in RNN architecture. Zhu et al. [52] design several time gates in LSTM units to improve modeling performance. He et al. [10] embed items into a "transition space," where users are modeled as translation vectors operating on item sequences. Liu et al. [26] employ adaptive context-specific input matrices and adaptive context-specific transition matrices to capture external situations and how lengths of time intervals between adjacent behaviors in historical sequences affect the transition of global sequential features, respectively. But, in practice, there is complex dependence and correlation between sequential user behaviors, which requires deeper analysis of relation among behaviors rather than simply modeling the presence, order, and time intervals. To summarize, how to design an effective RNN architecture to model sequential user behaviors effectively is still a challenging open problem.

Attention Mechanism.
Attention mechanism is now a commonly adopted ingredient in various deep learning tasks such as machine translation [1,27], image captioning [24], question answering [38], and speech recognition [5], which has been shown to be effective against capturing the contribution and correlation between different components in the network. e success of attention mechanism is mainly due to the reasonable assumption that human beings do not tend to process the entire signal at once; instead, they only focus on selected portions of the entire perception space when and where needed [17]. To avoid the limitation of normal networks that the entire source must be encoded to one hidden layer, the attention-based network contains a set of hidden representations that scale with the size of the source. e network learns to assign attention weights to perform a soft selection of these representations.
With the development of attention mechanism, recent researches start to leverage different attention architectures to improve performance of related tasks [1,3,34,40,46,49]. For example, Bahdanau et al. [1] conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture; therefore, they design a model to automatically search for parts of a source sentence which are relevant to predicting a target word. Yang et al. [49] propose a hierarchical attention network at word and sentence level, respectively, to capture contributions of different parts of a document. Vaswani et al. [40] utilize multihead attention mechanism to improve performance. Wang et al. [46] propose a coverage strategy to combat the misallocation of attention caused by the memorylessness of traditional attention mechanism. Nevertheless, most of the previous work calculates attention distribution according to the interaction of every source vector with a single embedding vector of contextual or historical information (such as translated words in the sentence), which may lead to information loss caused by early summarization and noise caused by incorrect previous attention. In particular, Shen et al. [35] propose an attentionbased language understanding method without any other network structure (e.g., RNN). In [35], the input sequence is processed by directional (forward and backward) self-attentions to model context dependency and produce contextaware representations for all tokens. en, a multidimensional attention computes a vector representation of the entire sequence.
Indeed, the attention mechanism is very important to the task of modeling sequential user behaviors. However, to the best of our knowledge, there are a few works concentrating on this paradigm. Chen et al. [2] consider the attention mechanism into a multimedia recommendation task with multilayer perceptron. Song et al. [37] propose a recommender system for online communities based on a dynamicgraph-attention neural network. ey model dynamic user behaviors with a recurrent neural network and contextdependent social influence [23] with a graph-attention neural network, which dynamically infers the influencers based on users' current interests. In this paper, an effective solution with attention mechanism for better modeling sequential user behaviors is to be investigated.

Problem Formulation
We start our discussion with the definitions of some notations. Let U be a set of users and let I be a set of items in a specific service such as products in online shopping websites. For each user u ∈ U, his/her historical behaviors are given by where k denotes the user's action and (i u k , t u k ) denotes the interaction between user u and item i u k at time t u k ; interaction has different forms in different services, such as clicking, browsing, and adding to favorites. e objective of modeling sequential user behaviors is to predict the conditional probability of the user's next item p(i u N u +1 | H u , t u N u +1 ) for a certain given user u.
We take RNN as the basic model, which generates the conditional probability in multiple steps sequentially. At step k, the k-th item i u k is vectorized into x k and then fed into RNN units by nonlinear transformation, e.g., multilayer perceptron. en, it updates the hidden state of RNN units, i.e., h k � RNN(x k , h k−1 ), as well as the output of RNN units. e representations of hidden state and output are trained to predict the next item vectorized as x k+1 given h k . To train the RNN, we aim to maximize the likelihood of historical behaviors of a set of users U: where i u k+1 is the target item for the given user u. In other words, we aim to minimize the negative logarithmic likelihood, that is, the objective function:

Complexity
where θ is the set of parameters in the RNN model. To fulfill this learning, it requires us to design an effective RNN architecture including the inner functions of RNN cells and the overall network structure, which together approximate a highly nonlinear function for obtaining the probability distribution of next item. In this process, RNN usually suffers from the complex dependency problem, especially when we deal with user actions that have much noise and randomness. Attention mechanism is a possible solution, which constructs a pooling layer on top of the RNN cells at each step to characterize the dependence between the current intent and all of the historical actions. We will describe our designed network architecture with attention mechanism in the next section.

ALI-GRU
As illustrated in the left part of Figure 2, our designed network features an attention mechanism with long-term interval-based gated recurrent units for modeling sequential user behaviors. is network architecture takes the sequence of items as raw signal. ere are four stages in our network. e embedding layer maps items to a vector space to extract their basic features. e bidirectional GRU layer is designed to capture the information of both long-term preferences and short-term intents of user; it consists of normal GRUs and time interval-based GRUs (see Figure 3). e attention function layer reflects our carefully designed attention mechanism, which is illustrated in the right part of Figure 2. Finally, there is an output layer to integrate the attention distribution and the extracted sequential features and utilize normal GRUs to predict the conditional probability of next item.

Embedding Layer.
e purpose of the embedding layer is to map the raw data of items into a rectified vector space, where the vectorized representations of items still keep the semantics of the items; e.g., semantically relevant items have small distance in the vector space. Usually items can be first represented as one-hot vectors and then processed by several fully connected layers [52]. If however the number of items is too large, pretrained encoding network is useful to process the items, which encodes not only the basic properties such as category of items but also the crowdsourcing properties such as sales of items [4]. In this paper, we adopt these two strategies for different datasets, respectively.

Bidirectional GRU Layer with Time-GRU.
is layer is designed to extract driven signals from input sequence and to refine the long-term memory by contextual information. We now detail our method for these two targets.
In the previous work for natural language processing tasks, the attention function is driven by a single vector of input [1,27,40].
at model works well because of the relatively stable syntax and semantics of input words. However, sequential user behaviors contain much noise and randomness that make the simple model problematic. We propose a new network structure with time-GRU to extract short-term dynamics of user intents as driven signal of the attention function. e structure of time-GRU in comparison with normal GRU is shown in Figure 3, where the black lines denote the network links of normal GRU and the red lines denote new links of time-GRU. e normal GRU are as follows: where I N denotes the N-th sequence item vector. h N−1 denotes the (N − 1)-th hidden state vector. h N is the candidate activation. z N represents the update gate, which decides how much the unit updates its activation. r N is the reset gate to control how much the last state contributes to the current activation. σ represents the sigmoidal nonlinearities function and tanh represents the tanh nonlinearities function, and ⊙ is an element-wise multiplication. Weight parameters W z , W h , W r and U z , U h , U r connect different inputs and gates; parameters b z , b h , b r are biases. e above equations imply that normal GRU is good at capturing the general sequential information. Since GRU is originally designed for NLP tasks, there is no consideration of time intervals within inputs, which are very important for modeling sequential user behaviors. To include the shortterm information, we augment the normal GRU with a time gate T N : where △t N is the time interval between adjacent actions. e constraint W t < 0 is to utilize the simple assumption that smaller time interval indicates larger correlation. Moreover, we generate a time-dependent hidden state h t N in addition to the normal hidden state h N ; that is, where we utilize the time gate as a filter to modify the update gate z N so as to capture short-term information more effectively.
In addition, we want to utilize contextual information to extract long-term information with as little information loss as possible. Recent methods usually construct a bidirectional RNN and add or concatenate two output vectors (forward and backward) of bidirectional RNN. Bidirectional RNN outperforms unidirectional one but still suffers from the embedding loss, since the temporal dynamics are not considered enough. On the contrary, we propose to combine the output of forward normal GRU (h N in equation (6)) with all the outputs of backward GRU at different steps (the output of backward GRU at step k is denoted by h ← k in Figure 2). Specifically, we produce concatenated vectors

Attention Function Layer.
Attention function layer is responsible for linking and analyzing the dependence and contribution over driven signals and contextual long-term information provided by the previous layers. Unlike previous attention mechanisms, we do not simply summarize the contextual long-term information into individual feature vectors, e.g., using p(i 1 , h N−1 ) for calculating the attention weight of item i 1 to the hidden state at the (N − 1)-th step [46]. Instead, we design to attend to the driven signals at each time step along with the embedding of contextual information. Specifically, as shown in the right part of Figure 2 and as already discussed in the last subsection, we use H k � where d is the dimension of GRU states, to represent the contextual longterm information. h t k ∈ R d denotes the short-term intent reflected by item i k . We then construct an attention matrix A ∈ R (N− 1) * (N− 1) , whose elements are calculated by where the attention weight, is adopted to encode the two input vectors. v T are the weight parameters. ere is a pooling layer, for example, average or max pooling, along the direction of long-term information, and then there is a Softmax layer to normalize the attention weights of each driven signal. Let a k be the normalized weight on h t k ; then the attended short-term intent vector is h ∈ R 4d as the output to the next layer, where i k is the embedded vector of the item at the k-th step.
We want to emphasize the insight of our carefully designed attention mechanism described above, which is different from the existing methods, to reduce the loss of contextual information caused by early summarization. Furthermore, since driven signals are attended to the longterm information at different steps, the attentions can obtain the trending change of user's preferences, being more robust and less affected by the noise in the historical actions.

Output Layer.
Given g(i k , h t k ) produced by the attention function layer, we use a layer of normal GRUs to produce the embedding vector (h o k in Figure 2), which is expected to contain the contextual long-term information about all of the user's historical actions with respect to the single item and short-term intents. e embedding vector is then decoded to produce the final result. For example, we use a Softmax function after a fully connected layer to obtain the probability distribution of different items in the next action: If the number of candidate items is too big, we shall use a slightly different decoding function, which will be detailed in Section 5.3.

Experiments
In this section, we first describe the used datasets and several state-of-the-art approaches that were compared as baselines in this paper. en, we report and discuss experimental results on different datasets.

Datasets.
To verify our proposed ALI-GRU, we conduct a series of experiments on two well-known public datasets (LastFM (http://www.dtic.upf.edu/∼ocelma/Music RecommendationDataset/lastfm-1K.html) and CiteULike (http://www.citeulike.org/faq/data.adp)). Additionally, we also perform offline and online experiments on the real data from one of the largest online shopping websites. Table 1 shows the statistics of LastFM and CiteULike: (i) LastFM contains <user id, timestamp, artist id, song id> tuples collected from Last.fm API (https:// www.last.fm/api/). It represents the whole listening 6 Complexity habits (till 5 May 2009) for 1000 users. We extract tuples <user id, song id, timestamp> from the original dataset to conduct experiments, where each song id represents an item and each tuple represents the action or behavior that the user user id listens to the song song id at time timestamp. (ii) CiteULike consists of the tuples <user id, paper id, timestamp, tag>, where each tuple represents that the user user id annotates the paper paper id with tag at time timestamp. One user annotating one research paper (i.e., item) at a certain time may have several records, in order to distinguish different tags.
We merge them as one record and extract tuples <user id, paper id, timestamp> to construct dataset like in [52].

Compared Approaches.
We compare ALI-GRU with the following state-of-the-art approaches for performance evaluation: (i) Factorized Sequential Prediction with Item Similarity Models (Fossil) [11]. is is a state-of-the-art factorized sequential prediction method based on Markov processes. Fossil also considers the similarity of explored items to those already consumed/ liked by user, which achieves a certain success to handle the long-tail problem. We have used the implementation provided by the authors (https:// drive.google.com/file/d/0B9Ck8jw-TZUEeEhSWXU2WWloc0k/view). (ii) Basic GRU/Basic LSTM [7]. is method directly uses normal GRU/LSTM as the primary network. For fair comparison, we set the network to use the same embedding layer and the same decoding function as our method. (iii) Session RNN [13]. Hidasi et al. propose an RNNbased method to capture the contextual information according to sessions of user behaviors. In our experiments, we use a commonly adopted method described in [16] to identify sessions so as to adopt this baseline. (iv) Time-LSTM [52].
is method utilizes LSTM to model the pattern of sequential user behaviors. is approach is designed to verify the effectiveness of our designed attention mechanism. e SV1 approach is identical to ALI-GRU, with the only difference being that SV1 uses an attention mechanism provided in [1,46]  is approach is designed to verify the effectiveness of our proposed time-GRU for generating driven signals according to short-term information. Compared to ALI-GRU, the only difference is that SV2 uses single item at each step (the embedded vector) to attend to contextual information.
All RNN-based models are implemented with the open source deep learning platform TensorFlow (https://www. tensorflow.org/). Training was done on a single GeForce Tesla P40 GPU with 8 GB graphical memory.

Experiments on LastFM and CiteULike.
We first evaluate our method on two well-known public datasets for the task of sequential recommendation.

Datasets.
In this experiment, we use the same datasets as those adopted in [52], i.e., LastFM and CiteULike. Table 1 presents the statistics of these two datasets. Both datasets can be formulated as a series of tuples <user_id, item_id, timestamp>. Our target is to recommend songs in LastFM and papers in CiteULike for users according to their historical behaviors.
For fair comparison, we follow the segmentation of training set and test set as described in [52]. Specifically, 80% users are randomly selected for training. e remaining users are for testing. For each test user u with k historical behaviors, there are k − 1 test cases, where the k-th test case is to perform recommendations at time t u k+1 given the user's previous k actions, and the ground-truth is i u k+1 . e recommendation can also be regarded as a multiclass classification problem. For more details, please refer to [52].

Implementation.
Following the method in [52], we use one-hot representations of items as inputs to the network and one fully connected layer with 8 nodes for embedding. e length of hidden states of GRU-related layers including both normal GRU and time-GRU is 16. A Softmax function is used to generate the probability prediction of next items. For training, we use the AdaGrad [8] optimizer, which is a variant of Stochastic Gradient Descent (SGD). Parameters for training are minibatch size of 16 and initial learning rate of 0.001 for all layers. e training process takes about 8 hours.

Evaluations.
In the test stage, following the evaluation method in [52], we select 10 items with top probabilities as final recommendations. We use Recall@10 to measure whether the ground-truth item is in the recommendation list. Recall@10 is defined as where n hit is the number of test cases where i g is in the recommendation list and n testcase is the number of all test cases. We further use MRR@10 (Mean Reciprocal Rank) to consider the rank of ground truth in the recommendation list.
is is the average of reciprocal ranks of i g in the recommendation list. e reciprocal rank is set to 0 if the rank is above 10.

Overall Performance.
e results of sequential recommendation tasks on LastFM and CiteULike are shown in Table 2. It can be observed that our approach performs the best on both LastFM and CiteULike for all metrics, which demonstrates the effectiveness of our proposed ALI-GRU. Specifically, ALI-GRU obtains significant improvement over Time-LSTM, which is the best baseline, averagely by 4.70% and 6.55% for Recall@10 and MRR@10, respectively. It owes to the superiority of introducing attention mechanism into RNN-based methods, especially in capturing the contribution of each historical action.

Performance of Cold-Start.
Cold-start refers to the lack of enough historical data for a specific user, which often decreases the efficiency of making recommendations. We analyze the influence of cold-start on the LastFM dataset and the results are given in Figure 4. In this figure, test cases are separately counted for different numbers of historical actions, and small number refers to cold-start. We can observe that, for cold users with only 5 actions, ALI-GRU performs slightly worse than the state-of-the-art methods.
is is because ALI-GRU considers short-term information as driven signals, which averages source signal to some extent and leads to less accurate modeling for cold users. Along with the increase of historical actions, ALI-GRU achieves significantly better performance than the baselines, which indicates that bidirectional GRU and attention mechanism can better model the long-term preferences for making recommendations.

Offline Experiments.
We have collected a large-scale dataset from a real-world e-commerce website for further performance evaluation. ALI-GRU is also adopted online and results of online A/B test will be reported in the next section.

Dataset.
User behaviors of this dataset are randomly sampled from the logs of clicking and purchasing in seven days (a beginning week of July 2017) on a real-world e-commerce website. e dataset is again formulated as a series of tuples <user_id, item_id, timestamp>.
We focus on the task of the personalized search of e-commerce websites. So we define positive cases as those purchasing behaviors led by the e-commerce search engine mentioned above, while negative cases are those clicks without purchases (if there is no purchase around the click within 5 actions). Finally, we have 24, 282, 032 positive cases, 84, 322, 922 negative cases, 30, 602, 427 users, and 10, 808, 463 items. We randomly select 80% users for training, and the remaining users are for testing. For each positive or negative case i k in the sequence, our target is to predict whether the user would purchase i k according to his/ her historical behaviors, which is a typical binary classification problem.

Implementation.
Since the amount of items is too huge in this dataset, it is inconvenient to employ the one-hot representations as inputs to RNN-based models. Instead, we use pretrained embedding vectors of items as inputs and additionally use two fully connected layers, both with 128 nodes, to reembed the item vectors. Also, we have followed the wide & deep learning approach in [4] to convert the outputs of the final fully connected layer, whose size is 48, into representations of corresponding items. For fair comparison, all RNN-based approaches employ the pretrained item representations as inputs. Hidden state size of GRU-related layers is 128. We finally use the sigmoid function to predict whether the user would purchase i k . For training, the loss function is cross-entropy and AdaGrad optimizer is employed with a minibatch size of 256 and an initial learning rate of 0.001 for all layers. e entire training process takes about 50 hours.  8 Complexity

Evaluations.
In the test stage, we use Precision-Recall of positive cases to measure the performance. Moreover, AUC (Area under ROC Curve) is also adopted, which is widely used in imbalanced classification tasks [12]. e larger the value of AUC is, the better the performance is. Table 3 shows the AUC results for measuring the overall performance. We can observe that all RNN-based methods except Basic-GRU outperform Fossil that is based on matrix factorization with Markov processes, which indicates the advantage of RNN for modeling sequential data. Furthermore, for different views of the capabilities of different RNN-based approaches, we also report Precision-Recall curves shown in Figure 5 and make some comparisons and summarize our findings as follows.

Basic GRU versus Session RNN versus Time-LSTM.
Session RNN and Time-LSTM achieve significant improvement compared to Basic GRU, which is consistent with the previous results on public datasets. is is due to the limitation of Basic-GRU/Basic-LSTM in modeling complex long-term sequential data. Compared to Session RNN, Time-LSTM achieves better performance for high precision range (precision larger than about 0.73), which owes to the advantage of short-term intents for predicting highly confident items. On the contrary, Session RNN outperforms Time-LSTM for low precision range (precision lower than about 0.73), since Session RNN introduces session view to better model contextual information and then benefits from recalling items based on long-term preferences of user.

Session RNN versus Time-LSTM versus ALI-GRU.
By adopting time gate, which is strong at modeling shortterm dynamics, and bidirectional RNN, which leads to advantages for modeling long-term information, ALI-GRU better analyzes the complex dependence among items and user intents, together with a novel matrix-form attention mechanism to enhance the performance. ALI-GRU outperforms Session RNN and Time-LSTM by as much as 10.96% and 8.53% for AUC (Table 3), respectively. Observing Precision-Recall curves, we found that ALI-GRU beats Session RNN and Time-LSTM over the entire range, and the improvement is more significant for the high precision range. e superior performance of ALI-GRU on various datasets and views demonstrates its efficacy to handle long-term sequential user behaviors with dynamical short-term intents.

SV1 and SV2 versus Others.
We also show results of SV1 and SV2 for ablation analyses ( Figure 5). Observing the curve of SV1, we can find that the previous attention mechanism with bidirectional GRU only achieves slight improvement compared to Time-LSTM, which indicates the limitation of the previously studied attention mechanism for capturing dynamical importance of items in sequential user behaviors. On the contrary, SV2 outperforms both Session RNN and Time-LSTM consistently, especially for low precision range. It suggests that our proposed matrix-form attention mechanism with bidirectional GRU has superior capacity in distinguishing items' importance for modeling long-term preferences of user. Nevertheless, the curve of SV2 drops a lot when the precision is larger than about 0.82, where it is comparable to SV1 and Time-LSTM.
is is because user behaviors and intents are dynamic with a certain randomness, and single item is not robust enough for calculating attention distribution and capturing short-term intents. Last but not least, we can find that ALI-GRU leads to performance boost against SV1 and SV2 consistently. It demonstrates the advantages of our carefully designed matrix-form attention with long-term interval-based GRU framework for modeling sequential user behaviors.

Case Study and Insights.
We present three cases in Figure 6 for comprehensive study to give some insights of our proposed approach. Each case consists of one user's historical items ordered by click time and shows the attention heat map for each item and the final item (click or purchase) with prediction and ground truth.
Case A. e user clicked items of several classes such as watch, handbag, and dress and finally purchased the watch that was clicked long before. We make a few observations as follows: (1) ALI-GRU gives higher weights to most watches than other items, which is consistent with the user's contextual intent (purchasing a watch). It suggests that our proposed approach has capacity to capture user's real intents from historical behaviors. (2) e 1st, 3rd, and 5th watches, which are the same as or similar to the final purchased watch, have higher weights than the other watches, especially for the  1st watch, though it was clicked earliest for a long time ago. More interestingly, we can observe that the 6th watch is for women, and the user is probably a woman (according to the dresses he/she clicked), but the 6th watch has the relative lowest weight in all watches. ese observations indicate that ALI-GRU successfully distinguishes the user's current intent for purchasing a watch for men. Case B. If items were inherently important or repeated or had low frequency, models without attention mechanism might work well since such model could automatically assign low weights to irrelevant items and vice versa. However, the importance of items and user intents is highly dependent on context and is consistent to a certain degree. In Case B, the user finally purchased a coat hanger that belongs to the class he/she never clicked. Nevertheless, ALI-GRU looks at the context of his/her recent behaviors, conjectures the intent is possible for something about laundry, and correctly figures out this is a positive case.
Case C. It is not a correct prediction case according to ground truth. Observing the historical behaviors and attention distribution of the user, we can find that ALI-GRU chooses to ignore various items before the first suit; these actions had a long time interval (about two days) from the latter actions. Furthermore, ALI-GRU conjectures that the user wants to buy something for formal wearing. erefore, ALI-GRU predicts it is a negative case for purchasing a USB cable, which is purchased by the user finally. In such cases, there exist choppy and decisive intents of user, which is a great challenge left for future exploration.

Online Test.
Online test with real-world e-commerce users is carried out to study the effectiveness of our proposed method. In particular, we integrate ALI-GRU into the e-commerce search engine mentioned above, which has billions of clicks per day. A standard A/B test is conducted online. Users of the search engine are randomly divided into multiple buckets, and we randomly select two buckets for experiments. For users in bucket A, we use the existing highly optimized ranking solution of the search engine, which performs Learning to Rank (LTR) and Reinforce Learning (RL) with several effective algorithms such as wide & deep learning and CF prediction. For users in bucket B, we further integrate the results produced by ALI-GRU. Specifically, for a given user, his/her sequential behaviors (clicked items and timestamps) are collected from the entire service, and the user's intent vector is predicted by ALI-GRU in real time. When the user provides a query, we combine the calculated user intent vector with all the retrieved items to calculate purchasing probability, which is similar to the method of offline experiments. Finally, we integrate the purchasing probability into the existing ranking strategy.
Measures for the online A/B test include Gross Merchandise Volume (GMV), user Click rough Rate (uCTR), Click Conversion Rate (CVR), Per Customer Transaction (PCT), and Unique Visitor Value (UV_Value), which are all frequently used metrics in e-commerce [ e test was performed within one week in July 2017. Comparative results are given in Table 4, where the absolute values are omitted for business confidentiality. e results show that ALI-GRU achieves better performance for all the metrics. As we expected, uCTR and CVR are improved, which means that users are more likely to click the reranked items, and there is higher probability to purchase these  Figure 6: Case study of making prediction whether the user purchases the next item according to historical sequential behaviors. Each case consists of one user's clicked items ordered by click time and shows the attention heat map for each item and the final item (clicked or purchased) with prediction and ground truth. Darker color in the heat map indicates higher attention weight.
items. More interesting is the improvement of PCT and UV_Value, which is due to the increase of number of transactions per user with purchasing actions. is result suggests that our model provides kinds of recommendation functionality into search engine, such as case B in Figure 6. In summary, our proposed ALI-GRU consistently improves the high-quality baseline of one of the largest online e-commerce platforms that has been optimized for several years. Such improvements are very important for e-commerce search engine systems and have significant business value. ALI-GRU has been adopted into the search engine before this paper is prepared.

Conclusions
Modeling user behaviors as sequential learning plays an important role for predicting future user actions, such as personalized search and recommendation. However, most of RNN-based methods assume that the importance of historical behaviors decreases over time and fail to consider the crossdependence in sequences, which makes it difficult to apply to the real-world scenarios. To address these problems, we propose a novel and efficient approach called Attention with Long-term Interval-based Gated Recurrent Units (ALI-GRU) for better modeling sequential user behaviors. We first propose a bidirectional time interval-based GRU to identify complex correlation between actions and capture both long-term preferences and short-term intents of users as driven signals. en, we design a new attention mechanism to attend to the driven signals at each time step for predicting the next user action. e empirical evaluations on two public datasets for sequential recommendation task show that ALI-GRU achieves better performance than state-of-the-art solutions. Specifically, ALI-GRU outperforms Session RNN and Time-LSTM by as much as 10.96% and 8.53% in terms of AUC. In addition, online A/B tests in a real-world e-commerce search engine further demonstrate its practical value. As GRU cannot be calculated in parallel, it takes a lot of time to train the model. In the future, we will adopt a parallel approach to solve this problem.

Data Availability
e data used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare no conflicts of interest.