Knowledge Graph-Based Enhanced Transformer for Metro Individual Travel Destination Prediction

Accurate and timely destination prediction of subway passengers is of great signiﬁcance in improving urban residents’ travel eﬃciency, alleviating urban traﬃc pressure, and recommending the proper location-based service. Although some individual travel destination prediction methods have been proposed, the prediction performance is poor due to the large diﬀerence in travel locations of diﬀerent individuals, the diﬃculty of evaluating the individual travel intention, the sparsity of individual travel trajectory data, and other problems. To solve these problems, this paper proposes a knowledge graph-based enhanced Transformer method (KG-Trans) for the metro individual travel destination prediction task (MITD-Pre), which contains three main modules: (1) the knowledge graph (KG) module constructs a multilayer individual travel KG from top to bottom, which accurately describes the travel individuals and their travel intentions. By analyzing the association relationship between nodes in the KG, the relationship between travel individuals can be naturally established. The learned similar travel regularity can solve the problem of sparse travel trajectories of some individuals. (2) The enhanced Transformer module extracts the dynamic and hierarchical features from the long-term sequential travel trajectory data. (3) The classiﬁer module introduces the cross-entropy loss to constrain the uniqueness of the predicted subway travel station. The experimental results show that the proposed method obtains a higher destination prediction accuracy than the previous individual travel destination prediction methods.


Introduction
As an important part of public transport, the urban rail transit produces a large amount of spatiotemporal trajectory data in real time, which contains rich spatiotemporal location information and reflects the travel mode of passengers.
is gives us an opportunity to deeply explore the individual travel patterns and regularity. Traffic prediction is a very basic and important problem in the field of transportation. Most existing traffic prediction methods focus on traffic flow, speed, and so on [1][2][3]. However, with the development of information technology and various intelligent devices, strong data support and technical support are created for individual travel destination prediction. e realtime prediction of the travel destination of each individual who stays in the subway station is of great significance for the tracking of individuals, service recommendations, and the construction of the smart city. And, it is bound to become an important social demand in the era of big data.
At present, a few methods have been proposed in the field of travel destination prediction, such as the Markov model [4], Bayesian model [5], and Gaussian mixture model [6]. ese methods predict the travel destination according to the general mobility characteristics of individuals. However, they ignore the differences in individual behaviors between users [7,8] and the problems of individual travel data, such as the high spatial complexity and sparse historical travel trajectory. erefore, their prediction results are unsatisfactory. In addition, considering that there is still a huge challenge in the task of individual travel destination prediction, that is, how to accurately grasp the individual travel intention, travel intention may be affected by time, location, and other factors. For example, when individuals travel to the same place, their travel intentions may be different on weekdays and weekends. ese challenges are beyond the previous methods.
As we all know, KG is a very advanced carrier containing a lot of common-sense knowledge and plays an important role in many practical applications. e emergence of KG provides a new perspective to comprehensively describe individual travel patterns. It carries out much application research; e.g., the entity portrait and the law prediction tasks are achieved by utilizing relationship reasoning and knowledge aggregation. So, KG provides a new method of support for accurately quantifying individual travel patterns in the public transport. It effectively breaks through the traditional expression limitations based on the traffic big data.
Inspired by the KG, this paper integrates deep learning prediction and the KG to achieve accurate travel destination prediction tasks. Specifically, we construct an individual travel KG based on the historical travel data of individuals and then conduct the portrait analysis based on the KG to accurately grasp the individual travel intention. At the same time, for the historical travel trajectory data with the long-term time series and the long-term time dependence, Transformer is used to learn the dynamic and hierarchical characteristics in the sequence data to achieve the final prediction tasks.
is study mainly aims to integrate the KG into the individual travel destination prediction model. e main contributions of this study are summarized as follows: (i) An individual travel KG is constructed, and we propose a novel individual travel destination prediction method based on such KG, which aims to accurately analyze the individual travel patterns and intentions (ii) We analyze the travel groups having similar travel trajectories to handle the sparse historical trajectories of some individuals, which is obviously different from the traditional individual travel destination prediction methods (iii) An enhanced Transformer module is proposed to extract the dynamic and hierarchical features in the history of travel trajectory data (iv) Experimental results show that the proposed method effectively exploits KG to analyze the subway card data and obtains satisfactory performance

Related Work
In this section, we review several related types of research about the KGs and the individual travel destination predictions.  [13], the music domain graph MusicBrainz [14], and the geographic domain graph GeoNames [15].

Knowledge
With the development of smart transportation, KG research in the transportation field has gradually become more and more popular. Zhou and Chen [16] combine the urban KG with the deep spatiotemporal convolution neural network to solve the problem of traffic congestion. Zeng et al. [17] use the KG to extract the mine relationships between objects, which models the causal relationships of the equipment failures of the railway trains to ensure the operational safety of the high-speed railway. Muppalla et al. [18] use the KG as an abstraction layer to annotate the traffic incidents collected through various methods. Liu et al. [19] learn the urban traffic characteristics extracted from the urban multisource heterogeneous data and construct a KG to mine the urban mobility patterns. Sun et al. [20] semimanually construct a microblog traffic event KG by integrating multiple types of open-source data and use such traffic KG and target detection methods to realize the identification of traffic events in microblogs and solve the traffic problems. Liang et al. [21] use the multilevel planning theory to construct an individual travel KG to accurately identify different types of public transport passengers so as to obtain refined public transport travel characteristics and meet the travel needs of different passengers. Zhang et al. [22] integrate the knowledge of interregional flow, events, and weather to enhance the prediction effects of population inflow and outflow in each region of the city.
As an advanced knowledge carrier, the KG has an extremely important position in the portrayal of individual travel. erefore, we construct an individual travel KG and use it to solve individual travel destination prediction problems.

Individual Travel Destination Prediction.
With the development of urbanization, people's travel patterns are gradually diversified. Understanding human behaviors and modeling individual travel behaviors are helpful to explain some complex socioeconomic phenomena, which is of great value in location-based services, traffic planning, public safety, and so on. Traditional trajectory prediction methods mostly use machine learning methods, such as the hidden Markov model [4], mixed hidden Markov model [23], Bayesian inference [5], and Gaussian mixture model [6]. Based on the research of public transport smart card data, Zhao et al. [24] predict the individual daily travel capacity, and its travel chain is defined as a set of travel start time, starting point, and destination. Wang et al. [7] extract a variety of features from the subway card swiping data set to predict the travel destination of passengers entering the subway station but not leaving the station. Li et al. [25] improve the prediction effect of the individual travel through clustering the group travel pattern. Wang et al. [26] design a new movement feature, i.e., a time shift tensor, to consider the user's transformation pattern in the time dimension and propose the attention Markov model. Mo et al. [27] analyze the passenger activity pattern based on the public transport card swiping data in Hong Kong and propose an inputoutput hidden Markov model to predict the time and location of an individual's next trip at the same time.
In recent years, recurrent neural network (RNN) has obtained the excellent performance in modeling sequence data. Wu et al. [28] propose a new robust location prediction model to consider individual preference and social interaction, which alleviates the impact of randomness of location movement and improves the prediction performance. De Brebisson et al. [8] predict the taxi destination by using a multilayer perceptron and a two-way cyclic neural network. Lv et al. [29] regard the trajectory as a two-dimensional image to model the trajectory from different perspectives and apply Convolutional Neural Networks (CNNs) to extract multiscale two-dimensional trajectory features for the accurate destination prediction. Zhang et al. [30] apply Surprisal-Driven Zoneout (SDZ) to RNN, which improves the robustness of the destination prediction model and reduces the training time. Based on the Long Short-Term Memory (LSTM) model, Li et al. [31] combine the extracted depth spatiotemporal features with the original features to predict the taxi destination. Xu et al. [32] use an adaptive attention network to model different extraction features of locations and implement the time gate and the distance gate into LSTM to capture the spatiotemporal relationship between continuous locations.
Although some individual travel destination prediction methods have been proposed, some common problems still seriously affect the prediction effect; for example, it is difficult to grasp the travel intentions of different individuals and handle the sparseness of historical trajectory data of some individuals.

Construction of Individual Travel KG
In this section, firstly, we preprocess the subway card swiping data set to clean out the dirty data, the duplicate redundant data, and so on. en, the individual travel KG is constructed and displayed visually.

Data Preprocessing.
e original data set adopts the passenger card swiping records collected by the Beijing Metro automatic toll collection system in July, August, September, November, and December 2015. Each record contains 11 items, including card ID, subway route, subway station, date, card type, and transaction type. e card ID is the unique identifier of the intelligent transportation card, which is used to identify a unique passenger.
Due to the large volume, the dirty data interference, the missing key items, the coupling of travel records, and other problems in the data set, the card swiping records of each travel individual are scattered in the large data set, which makes it difficult to form a complete travel chain and brings great obstacles in mining the passenger travel patterns. erefore, the original data must be preprocessed to extract the complete travel chain, so as to build a good data foundation for the construction of individual travel KGs, travel law mining, and travel destination prediction.
First, we remove the repeated card swiping records in the original subway data set and the records with the same card swiping time and the same card swiping station. Moreover, because the traffic card type and other information contribute less to the subsequent destination prediction task, we also filter this information to avoid the interference of redundant information. So, we retain 7 necessary items, including card ID, boarding line, boarding station, alighting line, alighting station, boarding time, and alighting time. In addition, since the research objects are passengers taking the subway as their normal transportation tool, we also remove the passengers whose records are less than the average of 30 card swiping records per month and whose data volume is abnormal. After these operations, 150-400 travel records in five months for each passenger are retained.

KG Construction.
Accurately grasping the individual's travel intention is the major challenge in the task of the individual travel destination prediction, and such travel intention is affected by many factors, for example, time and station. To achieve this purpose, we intend to construct the individual travel KG to accurately analyze the travel individuals and grasp the corresponding travel intention. erefore, we construct the individual travel KG using the passengers' travel location, time, date, and so on.
In this knowledge, there are five types of entities: card ID, date, travel date attribute (whether working day or not), route, and subway station and times. e corresponding relationships are divided into five categories as shown in Table 1.
We analyzed the travel data of 8000 individuals in five months. Specifically, after data preprocessing, we chose all the historical travel records of 8000 individuals from tens of millions of people. en, we extracted the travel record knowledge, and the steps are shown in Algorithm 1.

Visualization of KG.
rough the above knowledge extraction steps, we get the structured data and then visually display the graph by neo4j 1 . Taking the travel record of Card No. 787931 in July as an example, our constructed graph is shown in Figure 1

Methodology
In this section, we propose a knowledge graph-based enhanced Transformer for the MITD-Pre, namely, KG-Trans. e model framework is shown in Figure 2. Firstly, we identify and extract the travel individuals with similar routes Journal of Advanced Transportation based on the constructed individual travel KG. en, the same class of data is sent to the Transformer to train the model. Finally, the prediction results are compared to the real values stored in the KG to construct the loss function.

Relationship Analysis between Travel Individuals Based on KG.
Due to the large behavioral differences between individuals and the sparseness of the travel records of some individuals, the accuracy of individual travel destination prediction is seriously affected. In view of this phenomenon, we improve the effectiveness of the individual travel destination prediction through the "group effect." As for the group here, we define it as "people with similar routes." For example, for one group of commuters that live in Xierqi, Beijing, and work in Zhongguancun, Beijing, the subway routes on weekdays are very similar. is paper discovers the individuals having similar routes through analyzing the KG.
KG has many nodes and edges which contain complex information and rich semantics. We aim to find the correlation between nodes in the graph so that we can infer the correlation between the individuals. For example, there are many shared nodes between some passenger routes. As shown in Figure 3(a), we can judge that the two passenger routes are very similar. Some passenger travel routes have few or no shared nodes. As shown in Figure 3(b), it is considered that the similarity between the two passenger routes is relatively weak. In this paper, after analyzing the five-month travel data of 8000 individuals in detail and considering the impacts of classification accuracy on the prediction effect, we constrain the members in the same category satisfy the following rules: Step 1: get the frequency of travel. e frequency of travel of an individual (ID) is the total number of travel records of the ID within a specified time frame in the data set.
Step 2: get the travel date. In the individual travel card swiping record, the travel time is expressed as "year/month/day: hour: minute." We only take the date information and ignore the hour information.
Step 3: get the date attribute. For the travel date obtained in Step 2, we check the calendar to determine whether it is a working day. e working day is marked as 1, and the nonworking day is marked as 0.
Step 4: get the origin-destination (OD) records. e storage format of one complete travel record is "ID, boarding time, boarding route, boarding station, boarding time, alighting route, and alighting station." en, we extract the travel OD as "boarding line * boarding station -alighting line * alighting station." Step 5: get to the subway station. As shown in Step 4, the subway station is expressed in the form of "line * station." ALGORITHM 1: Knowledge extraction.  Journal of Advanced Transportation (i) For members in the same category, the difference in travel frequency recorded by card swiping should be less than 50 (ii) For members in the same category, they should have more than 70% similar routes between the subway stations en, on the constructed graph, we first divide the travel records of 8000 people into five length intervals, 150-200, 200-250, 250-300, 300-350, and 350-400 (named group 1 to group 5), according to the travel chain length interval. e node similarity between the members of each length interval is calculated according to the definition of route similarity, and finally, we obtain p classes of travel groups (p ≥ 5). In these p travel groups, the members of each group are "route similar members" to each other. In this way, by putting each class of members, we classified them into the model for training and we can solve the problem of sparse historical travel routes of some travel individuals according to their similarity.

Transformer-Based Individual Travel Destination Prediction.
e Transformer is an effective method to process the sequence data. Its multihead attention mechanism and stacking layer learn the dynamic and hierarchical characteristics of the sequence data. erefore, Transformer can predict the traffic flow with the long-term time series and long-term time dependence very well. Considering that the card swiping data of subway passengers have the properties of long-term time series and long-term time dependence, Transformer is naturally selected as an important module in the individual travel prediction model in this paper. e basic structure of the Transformer used in our model is shown in Figure 2. Its core module is the multihead attention layer. Firstly, the input of the model is the card swiping records of passengers with similar routes, i.e., R � X 1 , X 2 , ..., X m , where m denotes the number of passengers and X i � x 0 , x 1 , · · · , x n represents the historical trajectory sequence of the i-th passenger. e multihead attention (MH) layer adopts different linear mappings to project the input sequence elements to the query, key, and value, i.e., a tuple (Q, K, and V). e output is the corresponding weighted sum of values. e weight assigned to each value is calculated through the compatible functions of the query and the corresponding key. e application of attention can be expressed as where Attention () calculates the attention of the input data and softmax () is an activation function. To establish a single-head attention module, each node can have three subspaces, namely, the queries subspace Q ∈ R N×d k , key subspace K ∈ R N×d k , and value subspace V ∈ R N×d k , where d k is the dimension of queries, keys, and values.
In the global encoder, the input features are projected into the high-dimensional subspace and the learnable mapping is realized through the feedforward neural network, which can be expressed as where X is the input feature. W q , W k , and W v are the learnable parameters. e multihead attention network uses h feedforward neural networks to linearly project Q, K, and V, which achieves a multihead mechanism. In this case, the model can pay much attention to the information of different representation subspaces from different stations, which can be expressed as where MH () calculates the multihead attention of the input data, Concat () concatenates the input data, and the i-th head head i can be expressed as where the mapping matrices are where x n+1 is the travel destination of an individual. Finally, we use the softmax layer to give each station a probabilityP: where x i is the feature representation of i-th station and z is the total number of stations. e station with the largest probability is the final prediction result of the proposed model.

Loss Function.
e individual travel destination prediction problem is different from the previous OD prediction and taxi destination prediction problems. e performance of OD or taxi destination prediction tasks is evaluated by measuring the error of the predicted longitude and latitude. e individual destination prediction results are estimated by only correct (1) or incorrect (0). erefore, the cross-entropy loss function was selected as the loss function in this paper.
rough the current card swiping records of a travel individual, we want to predict the travel destination of the individual. is problem can be regarded as a probability problem in which we give a probability to each station and the station with the largest probability is the destination prediction result. Cross-entropy mainly reflects the distance between the actual output (probability) and the expected output (probability). Furthermore, the smaller the value of cross-entropy, the closer the two probability distributions are. Assuming that the probability distribution p is the expected output, the probability distribution q is the actual output, and the cross-entropy H(p, q) measures the distance between p and q, we have During training the deep learning network, given the input data and labels, the real probability distribution p(x) is determined.
erefore, the formula of the cross-entropy commonly used in deep learning is formulated as follows:

Experiment Settings.
In this paper, the proposed destination prediction model is compared with the five traditional prediction methods including Markov, LSTM, GRU, CNN, and FNN. Markov [4]: the Markov model is a statistical analysis model, which is widely used in speech recognition, automatic part of speech tagging, sequence classification, sequence prediction, and other applications.
LSTM [33]: Long Short-Term Memory (LSTM) can perform better in longer sequences, which is a special version of RNN. e problems of gradient disappearance and gradient explosion in the process of long sequence training are solved.
GRU [34]: the Gated Recurrent Unit (GRU) is an effective variant of the above LSTM network. It has a simpler structure and better effects than the above LSTM network and still solves the long-term dependency problem in RNN. erefore, it is an important manifold network at present. CNN [35]: e Convolutional Neural Network (CNN) is a feedforward neural network. For some sequence processing problems, the effect of the one-dimensional Convolutional Neural Network is comparable to that of the RNN, while the computational cost is usually much lower. FNN [36]: the feedforward neural network (FNN), also known as a multilayer perceptron (MLP), contains multiple fully connected hidden layers. e complex mapping from input space to output space is realized by aggregating multiple simple nonlinear functions.
We conducted the whole experiments on a GPU workstation, containing a 2080TI GPU with 11G memory. During the training process, the number of epochs is set to 200; the batch size is set to 32; the learning rate is set to 0.001; the number of self-attention heads in the Transformer is set to 8; the dimension of the input vector is set to 6; and the dropout rate is set to 0.3. All experiments use 3 historical observations to predict the next point in this paper.

Evaluating Indicator.
To fairly estimate the experiment performance, we refer to the evaluation indicator Hit@n that is exploited in the KG representation learning TransE [37]. During the testing procedure, the prediction results are arranged from large to small. It determines whether the first n results contain the correct options. If it contains the correct options, we increase the hit value by 1; otherwise, it processes to the next cycle. In other words, we do not require the first one value to be right (Hit@1 except), as long as there is the correct result in the first n results. e final accuracy ACC is calculated as follows: where hit is the number of travel stations that are predicted correctly and FP is the number of incorrectly predicted results. Table 2 shows the destination prediction results of above five compared methods on the card swiping data set of Beijing Metro. e data set is described in Section 3.1 in detail. It can be seen that our KG-Trans is obviously superior to other methods in all indicators. Group 1-Group 5 in the experimental data set are distributed from less to more according to the number of travel records of one individual (the detailed description is in Section 4.1). e experimental results show that the prediction effect of KG-Trans is more accurate as the number of travel records increases.

Prediction Performance of KG-Trans.
For the destination prediction problem solved in this paper, our data set has a long travel sequence of passengers. Although LSTM can mine the long-term sequence characteristics hidden in the data, the LSTM gradient will disappear and affect the prediction results when the sequence length exceeds a certain limit. Moreover, the passenger card swiping data are the sequence data, but the time intervals between records in the sequence do not have the obvious regularity. For the feature learning of such data, LSTM still has difficulty to capture the time correlation.
In Table 2, we can see that the prediction effect of GRU is the worst. We believe the reason is that GRU ignores the middle layer compared to LSTM. Although this operation reduces parameters and prevents over-fitting risk, the middle layer plays a key role in the long-term sequence feature extraction. erefore, the GRU effect receives the worst performance in our passenger travel prediction scenario. e prediction effect of CNN is similar to that of LSTM. e receptive field of CNN has a fixed size, and the processing effect of the local information is very prominent. It can capture some local specific features, while the capture ability of global features is worse than LSTM.
In addition to our model, FNN has the best experimental effects. FNN is a fully connected network, and the computational complexity is very large, which results in low training efficiency. In addition, FNN can only learn the highorder combined features while the low-order features are not modeled in the model.
Since the prediction effect of Markov heavily depends on the data of the previous time, the performance of the characteristic learning of long-term time series is poor, so the prediction effect of individual travel destinations is not good.
Compared with the above methods, with the help of the accurate portrait analysis of passengers in our individual travel KG, we can identify and extract the route similar passengers and put them into different in-depth learning models for training. Such KG solves the problem of the sparse individual travel trajectory by exploiting the characteristics of similar passenger routes. In addition, the Transformer has the multihead self-attention mechanism and the stacking layer, which has stronger structural flexibility and captures a wider range of time correlation. e Transformer achieves a very good prediction effect for the individual travel data series with the long-term time series and the long-term time dependence.

Conclusion
In this paper, a knowledge graph-based enhanced Transformer for the metro individual travel destination prediction method is proposed. e method of constructing individual travel KG is used to accurately analyze the travel individuals. And then, the Transformer's outstanding sequential learning ability is used to capture the sequence information in the individual travel chain. e test results on the AFC card swiping data set of Beijing Metro show that our method can well learn the regularity and characteristics of passenger travel records, which is greatly improved compared with previous studies. In future work, our model should also consider the influence of more external factors (the weather, the road network, and traffic events) and build more levels of traffic KG to improve the prediction accuracy. Moreover, in terms of individual travel destination prediction, in addition to the next travel location, the next travel time will also be the focus of our next research.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.