MDST-DGCN: A Multilevel Dynamic Spatiotemporal Directed Graph Convolutional Network for Pedestrian Trajectory Prediction

Pedestrian trajectory prediction is an essential but challenging task. Social interactions between pedestrians have an immense impact on trajectories. A better way to model social interactions generally achieves a more accurate trajectory prediction. To comprehensively model the interactions between pedestrians, we propose a multilevel dynamic spatiotemporal digraph convolutional network (MDST-DGCN). It consists of three parts: a motion encoder to capture the pedestrians' specific motion features, a multilevel dynamic spatiotemporal directed graph encoder (MDST-DGEN) to capture the social interaction features of multiple levels and adaptively fuse them, and a motion decoder to produce the future trajectories. Experimental results on public datasets demonstrate that our model achieves state-of-the-art results in both long-term and short-term predictions for both high-density and low-density crowds.


Introduction
e task of pedestrian trajectory prediction is to predict pedestrians' future trajectories given their historical trajectories in the scenario. Pedestrian trajectory prediction plays a notable role in many aspects, such as automatic driving [1] and robot navigation [2][3][4][5]. To predict an accurate trajectory, only considering the historical trajectory of the target pedestrian is not enough. Other pedestrians' influences on the target pedestrian, which are called "social interaction features," can often help make a better prediction. With the longer prediction horizon and denser crowds, the temporal correlations in the trajectories between current and previous time steps grow weaker and the impact of interactions on pedestrians' motion grows stronger.
To model social interactions, traditional methods use rule-based functions [6][7][8][9][10]. While rule-based methods can only capture simple interactions, data-driven methods use neural networks to automatically extract the social interaction features from the data, which can make use of the interaction features more effectively. Many data-driven methods obtained social interaction features based on pooling [11][12][13][14] or attention mechanisms [1,[15][16][17][18][19][20]. e graph convolutional neural networks have developed rapidly in recent years, and the graph structure is naturally suitable for directly describing the interactions between pedestrians. As a result, graph convolutional neural networks [21][22][23][24][25] have achieved excellent results in pedestrian trajectory prediction.
Although there are many graph convolutional neural network-based methods, they do not make full use of them. For example, Social-BiGAT [21] only uses the graph representation as the pooling mechanism on the states of the recurrent neural networks. e new methods STGAT [22] and Social-STGCNN [23] constructed spatiotemporal graphs to model social interactions and achieved excellent results in predictions.
However, they ignore a crucial point that even if the social interactions with nearby pedestrians or distant pedestrians are of the same type, they will result in different actions of the target pedestrian. As shown in Figure 1, at time steps t 1 and t 2 , when the target pedestrian marked with the red circle avoids the nearby pedestrians and the distant pedestrians, respectively, his avoidance movements will be different. e former is a sudden avoidance producing a trajectory with high curvature, while the latter is an early avoidance producing a trajectory with low curvature. Moreover, with the increase in prediction horizon, pedestrians far from the target pedestrian may become more important. From time step t 1 to t 2 , pedestrian B has little impact on the target pedestrian, but in the total period from t 1 to t 3 , the merging of them is the main factor affecting the target pedestrian's trajectory. In other words, the influence of nearby pedestrians is mainly sudden and short-term, while faraway pedestrians have long-term effects on the target pedestrian's movement tendency.
Most previous methods [21][22][23][24][25] use a single graph to model these two types of influences and tend to capture "average social interaction features." However, these two types of influences are more suitable to be modeled separately at different levels of a multilevel graph. Besides, many methods [23,25] build an undirected graph to model social interactions. However, social interactions between pedestrians are nonsymmetrical. erefore, building a digraph is more suitable for social interactions. Other methods [24] build a directed graph by predefined rules, such as inserting edges from all people inside the view area. But predefined rules are incomplete. For example, a pedestrian may slow down to wait for his companion without looking at him. us, a data-driven way to build a directed edge is much better.
To address the limitations of these works, we propose a multilevel dynamic spatiotemporal directed graph representation to model the interactions between pedestrians comprehensively. In our graph, different levels model interactions of pedestrians at different distance ranges. As shown in Figure 1, whether there is a spatial edge from a pedestrian to the target pedestrian at a level depends on whether their distance is within the corresponding distance range. With the change of time, the spatial edge between two pedestrians may break at one level and link at another. Even if the edge keeps linking at the same level, the influence of the neighbour also changes dynamically over time. To process the multilevel graph, we propose a multilevel dynamic spatiotemporal digraph convolutional network (MDST-DGCN). At each level of the graph, we use a node aggregator architecture to generate social interaction embedding by sampling and aggregating features from a node's spatial neighborhood like GraphSAGE [26]. Because social interactions are location independent, we do an aligning operation before aggregating features, which can advance performance significantly.
rough the orderly use of sampling, aligning, and aggregating, the aggregator architecture becomes a naturally data-driven way to describe a directed edge. For each level of the graph, after the spatial interactions are captured, an LSTM [27] is used to capture the temporal correlations of interactions. And then, MDST-DGCN fuses interaction features of all levels adaptively. rough modelling social interactions at different levels, our multilevel dynamic spatiotemporal digraph convolutional network (MDST-DGCN) can fully extract pedestrians' social interaction features.
In summary, our contribution is twofold. First, we propose using the spatiotemporal dynamic map with a multilevel concept to separate pedestrian nodes, resulting in varying effects on the trajectory depending on the distance between pedestrians, which may aid in the extraction of social interaction features by partitioning pedestrian distances at various levels. Second, we create an aggregator based on the GraphSAGE that converts the original static adjacency graph structure into a dynamic directed graph structure by sampling, aligning, and aggregating, reducing the effect of individual coordinates on the model and the overfitting phenomenon. We verified the performance of the model on the general pedestrian trajectory datasets.
e experimental results show that our model has achieved state-of-the-art results in both longterm and short-term predictions for both high-density and low-density crowds.

Pedestrian Trajectory Prediction.
Pedestrian trajectory prediction has become a focal task in recent years, and corresponding solutions have been springing up. Comprehensively modelling the interactions between pedestrians is a crucial point to obtain better prediction results.
Traditionally, researchers created hand-crafted functions [3, 6-10] to predict trajectories, but hand-crafted functions are limited, so they are unable to model all types of social interactions. Recently, deep learning-based methods have become popular because they can learn to model various interactions from data. Some researchers designed their methods based on pooling mechanisms [11][12][13][14] to capture dependencies between pedestrians. e S-LSTM [11] introduces a "social" pooling layer which allows the LSTMs of spatially proximal sequences to share their hidden states with each other. Group-LSTM [12] adjusts the pooling layer by dropping the information of pedestrians who are moving coherently with the target pedestrian. MX-LSTM [13] has a pooling layer, which exploits the Vislet information. e above three pooling methods only consider the pedestrians in the local area and fuse their features averagely, while SGAN introduces a pooling module considering all pedestrians in a computationally efficient way and adaptively select their features with a max-pooling operation.
As the graph structure is naturally suitable for directly describing the interactions between pedestrians, graph convolutional neural networks are introduced to this task. Social-BiGAT [21] replaced the pooling mechanisms with the graph attention network, which also works on the hidden states of LSTMs. In other words, Social-BiGAT did not model the whole duration of the crowds' interactions as a spatiotemporal graph but only used the graph attention network to capture the spatial social interactions. Social-STGCNN [23] and STGAT [22] both constructed spatiotemporal graphs to model social interactions. However, the graph of Social-STGCNN is a complete undirected graph. It does not conform to the asymmetry of pedestrian interactions. Zhang et al. [24] built a directed graph by inserting edges from all people inside the view area. However, all of these graphs model all the social interactions at only one level. Instead, we build a multilevel dynamic spatiotemporal directed graph to overcome their limitations.

Graph Convolutional Neural Network.
Graph convolutional neural network is an emerging topic in deep learning research, and it provides a practical approach to process graph data with nongrid structures. We can divide graph convolutional neural networks into spectral approaches [30][31][32] and spatial approaches [26,33,34]. Spectral approaches work with a spectral representation of the graphs, while spatial approaches define convolutions directly on the graph, operating on groups of spatially close neighbours. Spectral approaches' learned filters depend on the Laplacian eigenbasis, which depends on the graph structure. us, a model trained on a specific structure cannot be directly applied to a graph with a different structure. However, the graph used to model pedestrians' social interactions changes with time.
us, spectral approaches are not suitable for pedestrian trajectory prediction. And, our approach belongs to the spatial approaches.
In fact, our approach follows the methodology of GraphSAGE [26]. However, our graph is a multilevel dynamic spatiotemporal directed graph, while GraphSAGE can only process a fixed spatial graph without multiple levels. ST-GCN [34] built a dynamic spatiotemporal graph to automatically learn both the spatial and temporal patterns of human actions to recognize skeleton-based actions. Social-STGCNN [23], which is a variant of ST-GCN that builds a single-level undirected graph to model all the social interactions, has achieved excellent results in pedestrian trajectory prediction.

Problem Definition.
Given the historical trajectories of all pedestrians in the scenario, the task of trajectory prediction is to predict their future trajectories simultaneously. e notations p1, p 2 , . . ., p N represent N pedestrians in the scenario. e position of a specific pe- Our goal is to predict the positions of pedestrians at any future time step t tε[T obs + 1 + T obs + T pred ] , and for a specific pedestrian . e first-order difference trajectory of a pedestrian p i is defined as Figure 2(b), MDST-DGCN consists of three parts: a motion encoder, a multilevel dynamic spatiotemporal directed graph encoder (MDST-DGEN), and a motion decoder. e motion encoder is used Computational Intelligence and Neuroscience 3

Overall Model. As shown in
to capture the pedestrian-specific motion features, and the MDST-DGEN is used to capture the social interaction features. We construct a multilevel dynamic spatiotemporal digraph processed by the MDST-DGEN to model the social interactions between pedestrians. After the motion features and social interaction features are extracted, they are fed into the motion decoder to predict future trajectories.

Graph Construction.
We construct a multilevel dynamic spatiotemporal directed graph to model the multilevel social interactions between pedestrians. e nodes of the graph are the pedestrians in the scenario. Given the hyperparameter level distance list d 1 , d 2 , . . . , d K , we construct a graph with K levels. At each time step, if the distance from node v j to node v i is more than d k−1 and less than d k , a spatial edge from v j to v i will exist in the k th (k ∈ [1, K]) level. Specifically, in the 1 st level, a spatial edge exists when the distance is less than d 1 . For each node at all levels, we add a loop spatial edge. Figure 2(a) shows how to build a two-level spatial graph with the level distance list d 1 , +∞ at a certain time step. In addition to spatial edges, there are temporal edges, which connect the same pedestrians in consecutive frames. If there is only one level and d 1 � +∞, the graph will degrade into a complete graph, which is of the same structure as STGAT. At the time step t, the attribute of node v t i is the position X t i of pedestrian p i .

Motion Encoder.
e motion encoder is used to extract pedestrian-specific motion features. e input is the firstorder difference trajectory ΔX t i |t ∈ [1, T obs ] . e motion encoder is composed of a linear layer and an LSTM. e linear layer transforms the ΔX t i into a higher dimension vector. en, it is fed into the LSTM to get a motion feature vector. For each pedestrian p i , the process can be formulated as Here, W en denotes the trainable weights of the linear layer, W mo is the trainable weights of the LSTM (LSTM mo ), and the hidden states of LSTM mo at the previous time step and the current time step of pedestrian p i are denoted as h t−1 mo (i) and h t mo (i), respectively. At last, the motion encoder obtains each pedestrian's motion feature vector h T obs mo (i), which is marked as h mo (i) in the following sections.

MDST-DGEN.
MDST-DGEN is a crucial component of our model. It processes the multilevel dynamic spatiotemporal directed social graph to obtain the social interaction features. If the graph is of K levels, MDST-DGEN will have K DGCN-LSTMs to process each level of it and an MSFM to fuse the features extracted from each level. In our implementation, K DGCN-LSTMs share the weights, so increasing the number of levels does not increase the parameters of the model.

DGCN-LSTM.
After building the multilevel graph, each level of the graph is fed into a DGCN-LSTM. A DGCN-LSTM consists of a node aggregator architecture to process the spatial edges and an LSTM to process the temporal edges. We follow the design of GraphSAGE [26], which processes graphs by sampling and aggregating. Our node aggregator architecture generates embedding by sampling, aligning, and aggregating features from a node's spatial neighbourhood at each level.
3.6.1. Sampling. Due to the different numbers of pedestrians in the scene, to process all nodes of different graphs in parallel, we expand the number of neighbours to a fixed number m by uniformly sampling a certain number of neighbours. Here, if there is an edge from node v j to node v i , v j will be the neighbour of v i . We denote the m neighbours of any node v as the neighbourhood set N(v).

3.6.2.
Aligning. For the node v i , its attribute is the pedestrian's position X t i and the attributes of its neighbourhood set can be denoted as X t j |∀v j ∈ N(v i ) . Social interaction is location independent, so we design an aligning operation to make the node aggregator architecture more generalizable. After aligning is done, the aligned attributes of any node v i 's neighbourhood set can be denoted as . e intuitive understanding of the  Computational Intelligence and Neuroscience alignment operation is that we change the origin of coordinates to the position of node v i .

Aggregating.
After the aligning, we aggregate the aligned attributes of v i 's neighborhood set to obtain the new feature embedding of v i . It can be formulated as follows: where MAX is the max operator that take the elementwise max of the transformed attribute vectors f(X t j − X t i )|∀v j ∈ N(v i ) and f is the trainable linear mapping to convert a low-dimension vector to high dimension. We implement the max operator by using a maxpooling layer. rough the orderly use of sampling, aligning, and aggregating, our model can meet the requirement of a directed graph that the relation between two nodes in the directed graph is asymmetric.
After the spatial edges are processed, an LSTM is used to process the temporal edges as follows: where W g is the trainable weights of the LSTM (LSTM g ) and the hidden states of LSTM g at previous time step and current time step are correspondingly denoted as h t−1 g (i) and h t g (i). At last, the DGCN-LSTM obtains each pedestrian's social interaction feature vector h T obs g (i) at a certain level, and in the following sections, we denote h T obs g (i) of the k th level as H k g (i).

MSFM.
ere are K levels in our graph, so there are K DGCN-LSTMs and the node v i 's social interaction feature vectors obtained by them can be denoted as H 1 g (i), H 2 g (i), . . . , H K g (i) . We use an MSFM to fuse all levels' social interaction feature vectors of node v i . e MSFM computes the weighted sum of H 1 g (i), H 2 g (i), . . . , H K g (i) . e formulations are as follows: , Here, h mo (i) is the motion feature vector of pedestrian p i , · T represents transposition, H k g (i) is the corresponding social interaction feature vector at level k, the fusion weight α k i is a scalar, and H g (i) is the final fused social interaction feature vector.

Motion Decoder.
e motion decoder is used to predict future trajectories based on the motion features and the fused social interaction features.
ere are two types of motion decoders: motion decoders without noise and motion decoders with noise. e former makes the whole model a deterministic one, and the latter makes it a stochastic one. For the deterministic type, we only concatenate H g (i) and h mo (i) as the initial hidden state of an LSTM and we train the model with L1 loss. For the stochastic type, we con-catenateH g (i), h mo (i), and a noise vector z sampled from a standard Gaussian distribution to work as the initial hidden state of an LSTM. e formulation which shows how to get the initial hidden state of the stochastic motion decoder is as follows: Moreover, we train the whole model with the variety loss proposed by SGAN [14] to encourage it to produce diverse samples. At the first prediction time step T obs + 1, the decoder gets ΔX T obs i as the initial input and predicts the next position offset ΔX e predicted position offset is marked as ΔX t i |t ∈ [T obs + 1, T obs + T pred ] . e formulations which show how the stochastic motion decoder works are as follows: where L de and W pred are the trainable weights of the corresponding linear layers, concat means concatenating operation, and W de denotes the trainable weights of the LSTM (LSTM de ).

Datasets.
We evaluate our method on three commonly used datasets, ETH [35], UCY [36], and a high-density pedestrian dataset, pedestrian walk path dataset [37], which is referred to as PEDWALK in the rest of the article. ETH and UCY contain 1536 pedestrians' real-world trajectories, while PEDWALK contains the manually labeled trajectories of 12684 pedestrians, and coordinates are provided in pixels. e image size of PEDWALK is 1920 × 1080 pixels. ETH and UCY consist of a total of five unique scenes: ETH, HOTEL (from ETH), ZARA1, ZARA2, and UNIV (from UCY). For ETH and UCY, we follow the leave-one-out evaluation methodology in SGAN [14], training on 4 scenes and testing on the remaining one. For PEDWALK, we use 70% of its total frames for training and leave the remaining 30% for evaluation. e interval of trajectory sequences of ETH and UCY is 0.4 seconds, while the interval of trajectory sequences of PEDWALK is 0.8 seconds. We take 8 ground truth positions as observation and predict the trajectories of the following 12 time steps. It means, for ETH and UCY, we observe for 3.2 seconds and predict the future at 4.8 seconds (short-term prediction), while for PEDWALK, we observe for 6.4 seconds and predict the future at 9.6 seconds (long-term prediction).

Metrics.
ere are two commonly used metrics: average displacement error (ADE) and final displacement error (FDE). ADE is the average L2 distance between ground truth and the predicted trajectory over all the predicted time steps, and FDE is the distance between the predicted final position and the actual final position at the end of the prediction period T obs + T pred . For stochastic models, similar to prior work [14,22], 20 samples are generated and the closest sample to the ground truth is selected to compute ADE and FDE. After checking the codes of SGAN, STGAT, and Social-STGCNN, we find there are two different ways to select the closest sample: selecting the closest trajectory of each pedestrian in a sample used by Social-STGCNN [23] and selecting the closest sample used by SGAN [14] and STGAT [22]. A sample includes all pedestrians' trajectories in the scenario for a total duration of (T obs + T pred ) time steps. Following the tradition of SGAN and STGAT, we select the closest sample to compute the ADE and FDE of MDST-DGCN-S.

Model Configuration and Training Details.
For the motion encoder, the output dimension of the linear layer is 32 and the hidden state dimension of LSTM mo is 64. For the MDST-DGEN, the output dimension of f and LSTM g is 64.
We implement f with a convolution layer. To process nodes in different scenarios in parallel, the fixed neighbour number m needs to be larger than the maximum number of pedestrians in a sample. e most crowded scene in PED-WALK contains 133 pedestrians, and in ETH and UCY, there are 57 pedestrians in the most crowded scene. So we set it 135 for PEDWALK and 60 for ETH and UCY. For the motion decoder, the output dimension of Linear h is 32, the hidden state dimension of LSTM de is 64, and the output dimension of Linear pred is 2. For the MDST-DGCN-S, the dimension of the noise vector z is half of the hidden state dimension.
Our implementation is based on the PyTorch library. e model is trained on one NVIDIA GeForce GTX 1080Ti graphics card for 200 epochs. To calculate the variety loss with less GPU memory usage, we generate only 5 possible output predictions for each scene. In training, a batch size of 32 was used; we use the Adam optimizer with a learning rate of 0.0001. 1, 5, +∞ { } is the default-level distance list for ETH and UNIV, and 150, +∞ { } is the default-level distance list for PEDWALK.

Quantitative Evaluation.
To validate the proposed MDST-DGCN, we present the prediction performance for both short-term trajectory prediction on ETH and UCY and long-term trajectory prediction on PEDWALK, and we present the prediction performance for various pedestrian densities. We elaborate on an ablation study to validate the effects of our multilevel graph and the aligning operation. Table 1 shows, MDST-DGCN-D outperforms all deterministic methods and some stochastic methods on ETH and UCY. And, as Table 2 shows, MDST-DGCN-D even outperforms stochastic methods including STGAT. It shows that our model has good performance in capturing interaction features, and we think there are three reasons. First, PEDWALK has many more pedestrians in a scene than ETH and UCY, and then it has more interaction types and more frequent interaction activities in a sample. Second, high-density limits the randomness of pedestrian movement. ird, the prediction horizon on PEDWALK is 9.6 s, while it is 4.8 s on ETH and UCY. When the prediction horizon is short, lots of decisions in movement occur in the observation period and continue to the prediction stage, so lots of useful cues exist in pedestrians' motion features and it is not necessary to infer from interactive information. High density and long-term predictions enhance the impact of interactions on trajectory prediction, and high density reduces the effect of multimodality. Tables 1 and 2 show, when the best sample of 20 predictions is selected to calculate ADE and FDE, MDST-DGCN-S outperforms all methods on PEDWALK and achieves comparable ADE and FDE with STGAT. e reasons why MDST-DGCN-S is not better than STGAT on ETH and UCY are the same as the reasons stated in (1). When the best trajectory of 20 predictions is selected, MDST-DGCN-S outperforms Social-STGCNN in ADE, but Social-STGCNN gets better FDEs in several subdatasets. It is mainly because there are accumulated errors when LSTM is used in our model. Table 2 presents the results on the PEDWALK for various pedestrian densities. We use samples with the specified densities to make the comparison. With the increase in density, the performance of each method decreases. Both MDST-DGCN-D and MDST-DGCN-S outperform other methods for various pedestrian densities. When the density is low, such as 10 ≤ d ≤ 30, the performance gap between SGAN and other methods is much smaller, which means when crowds are sparse, the effects of interactions are smaller and models get fewer useful cues to infer pedestrians' future movements, but the multimodality will work better. is phenomenon also confirms our previous reasoning in (1). Table 3 presents the ADEs and FDEs of MDST-DGCN-D with different level distance lists. e level distance list +∞ { } means that MDST-DGCN-D models all social interactions at the same level, which is similar to STGAT and Social-STGCNN. Details about the level distance list are presented in Section 3. C. As shown in Table 3, modelling social interactions by a multilevel graph promotes the performance. On UNIV, the level distance list 1, +∞ { } helps MDST-DGCN-D to get the highest improvement. It is mainly because UNIV has a higher pedestrian density than the other four subdatasets, and more people will walk within one meter, the social comfort distance. Table 3, the aligning operation advances the performance on ETH, HOTEL, and UNIV, but it reduces the performance on ZARA1 and ZARA2. Because ZARA1 and ZARA2 are collected in the same place and have the same coordinate system, when they are used separately as a test set, the model without aligning will overfit on the coordinates.

Qualitative Evaluation.
We compare the predicted trajectories of MDST-DGCN-D and STGAT in Figure 3. Figure 3(a) shows that the target pedestrian is walking in the same direction with a nearby pedestrian A, and he will finally gather with a faraway pedestrian B, both of STGAT and MDST-DGCN-D. We successfully predict the merging phenomenon. However, MDST-DGCN-D succeeds in predicting that the target pedestrian maintains his relative position with nearby pedestrian A, while STGAT does not. us, MDST-DGCN-D obtains more accurate predictions. As shown in Figure 3(b), two pedestrians in a group are changing their directions in advance to avoid collisions with the pedestrians standing in the distance. For the target pedestrian, MDST-DGCN-D assigns a weight of 0.72 to the social interaction feature of the third level, which helps avoid possible collisions with distant pedestrians. However, STGAT only successfully predicts group behaviour but fails to predict early collision avoidance behaviour. All predictions in Figure 3      Computational Intelligence and Neuroscience indicate that a multilevel graph structure can model social interactions more accurately and comprehensively. We also visualize the trajectory distributions of MDST-DGCN-S and STGAT in Figure 4. As shown in Figure 4, in all three samples of pedestrian avoidance, pedestrian following, and pedestrian walking in group, our model outperforms STGAT.
We count the distribution of fusion weight (α) on PEDWALK, which shows that the social interaction features of the first level and second level are of different importance in a sample. e distribution of fusion weight (α) is shown in Figure 5.

Conclusions
In this article, we propose a multilevel dynamic spatiotemporal directed graph representation to model the interactions between pedestrians and introduce MDST-DGCN to process the multilevel graph. Experimental results indicate that our multilevel graph structure can model social interactions more accurately and comprehensively and show that MDST-DGCN outperforms most of the state-of-the-art methods.

Data Availability
Previously reported ETH and UCY data were used to support this study and are available at https://doi.

Conflicts of Interest
e authors declare that they have no conflicts of interest.