HAZMAT Vehicle Reidentification in Road Tunnels Based on the Fusion of Appearance and Spatiotemporal Information

Vehicles transporting hazardous material (HAZMAT) pose a severe threat to highway safety, especially in road tunnels. Vehicle reidentification is essential for identifying and warning abnormal states of HAZMAT vehicles in road tunnels. However, there is still no public dataset for benchmarking this task. To this end, this work releases a real-world tunnel HAZMAT vehicle reidentification dataset, VisInt-THV-ReID, including 10,048 images with 865 HAZMAT vehicles and their spatiotemporal information. A method based on multimodal information fusion is proposed to realize vehicle reidentification by fusing vehicle appearance and spatiotemporal information. We design a spatiotemporal similarity determination method for vehicles based on the spatiotemporal law of vehicles in tunnels. Compared with other reidentification methods based on multimodal information fusion, i.e., PROVID, Visual + ST, and Siamese-CNN, experimental results show that our approach significantly improves the vehicle reidentification recognition precision.


Introduction
Hazardous materials (HAZMAT) could endanger the health and safety of people, environment, and property. With the increasing demand of HAZMAT, trafc accidents occurred frequently during HAZMAT transportation, and especially, a risk increase is generally observed in the presence of tunnels [1][2][3], which makes it of great importance to tighten regulation for vehicles transporting HAZMAT in tunnels.
HAZMAT vehicle reidentifcation (ReID) methods face the following challenges in tunnel scenes: (1) the strong refection of the tank of a HAZMAT vehicle can cause large diferences in its appearance under the uneven lighting conditions of a tunnel; (2) it is difcult to distinguish the HAZMAT vehicles with the same vehicle type efectively, due to their close appearance. However, there still remains a research gap both in HAZMAT vehicle data and in specialized algorithms. Tis motivates us to focus on the study of HAZMAT vehicle reidentifcation in tunnels.
Vehicle ReID aims to determine whether a vehicle image captured in nonoverlapping cameras belongs to the same vehicle in trafc monitoring scenarios. Existing methods mainly perform research on vehicle ReID based on the vehicle appearance [4]. However, due to the special and complex tunnel environment containing dim illumination and limited viewing feld, it is more challenging for the tunnel vehicle ReID problem than that in open road scenes [5,6]. Tus, large fuctuation can be seen by merely conducting tunnel vehicle ReID based on the appearance information. As shown in Figure 1, the red, green, and blue lines in each subfgure are RGB channel color histograms for each image. Vehicles for the second and third images may have similar appearance features, whereas they are actually two diferent IDs. From such instance, we can see that in real-world applications, it is extremely sensitive to environmental changes to merely perform vehicle ReID via appearance information.
To address the above problem, except for appearance information, the spatiotemporal information is further leveraged to improve vehicle ReID performance in recent works [7][8][9]. Tis is inspired by the fact that the vehicle movements follow some implicit motion pattern according to the trafc rules. However, due to the randomness of vehicle motion, it is difcult to accurately model the spatiotemporal motion laws of vehicles in the open road. But the trafc rules of vehicles in tunnels are more distinct than in the open road, such as vehicles are expected to move in one fxed direction within limited speed, and no U-turns. It leads to the urgent need for a special spatiotemporal model tailored to the tunnel scene.
Terefore, to realize HAZMAT vehicle ReID in tunnel scenes, this work proposes a vehicle ReID method based on the fusion of vehicle appearance and tunnel spatiotemporal information. For vehicle appearance modeling, a deep residual network (i.e., Resnet50 [10]) is chosen as a feature extractor to model the complex appearance variation of tunnel vehicle. Meanwhile, to capture the spatiotemporal cues between cameras and vehicles, we develop a novel spatiotemporal similarity metric to model the between-vehicle structure correlation as well as the camera-vehicle topological relationship.
Furthermore, the extracted appearance representation and the spatiotemporal model are combined to efciently encode the appearance variation and movement pattern for the tunnel vehicles. Moreover, to evaluate the HAZMAT vehicle ReID problem in the tunnel scenes, we construct and release a realworld HAZMAT Vehicle ReID dataset, named by VisInt-THV-ReID, containing 10,048 images of 865 HAZMAT vehicles collected from four high-resolution cameras. Tese images were captured by 4 cameras in the tunnel. Each camera monitors a space with a range of 150 meters and takes around 3 pictures of vehicles with far, middle, and near distances, respectively. Each vehicle is attached by the camera mileage and the picture shooting time. According to the spatial coordinate transformation method [11], we infer the spatial positions of vehicles in tunnel from the perspective of camera monitoring and obtain their temporal information by comparing timestamps of monitoring cameras. We use the vehicle ReID to determine whether the HAZMAT vehicles are exiting the tunnel within a normal time. If one vehicle passes the tunnel more than once, we identify the HAZMAT vehicle with a diferent vehicle ID for each time in the dataset. More attention is paid to the driving condition of the HAZMAT vehicle each time when it passes through the tunnel. Te proposed method is evaluated to be efective through exhaustive experiments on the VisInt-THV-ReID dataset.
Te main contributions of this work are summarized as follows: (i) We extend the scenarios of vehicle ReID task to the challenging problem of HAZMAT vehicle ReID in tunnel scenes and propose a method that fuses both appearance modeling and spatiotemporal mining for more precise vehicle ReID. (ii) We design a spatiotemporal metric approach based on the movement law of vehicles in road tunnels which brings in the description of between-vehicle structure correlation as well as the camera-vehicle topological relationship. (iii) We build a real-world tunnel HAZMAT vehicle ReID dataset, named as VisInt-THV-ReID. As far as we know, the released VisInt-THV-ReID is the frst HAZMAT vehicle ReID dataset captured in the tunnel scenes, which is crucial for the promotion of automatic regulation of HAZMAT transportation. Exhaustive experiments demonstrate that the proposed method can generate a state-of-the-art performance.
Te rest of this work is organized as follows: Te review related works are presented in Section 2. Section 3 details the proposed HAZMAT vehicle ReID method. In Section 4, we execute experiments for the evaluation of the proposed approach on VisInt-THV-ReID. Finally, we conclude this work in Section 5.

Related Work
Vehicle ReID in trafc monitoring scenarios can be seen as a part of multicamera tracking. Given an image of a vehicle in a specifc area, the task is to fnd its image as captured under  other cameras. Tis work studies vehicle ReID with spatiotemporal information fusion in tunnel scenes. We introduce related work from the aspects of vehicle ReID in tunnel scenes and multimodal information fusion.

Vehicle ReID Methods in Tunnels.
Vehicle ReID in tunnel scenes is challenging due to low resolution, dim light, and dramatic changes in vehicle appearance. A vehicle is detected and tracked by each camera in road tunnels, and a detected vehicle is matched with the previous camera. Frías-Velázquez et al. [6] proposed a probabilistic framework based on a two-step strategy that reidentifes vehicles in road tunnels. Tey built a Bayesian model that fnds the optimal assignment between vehicles of connected groups based on descriptors such as trace transform signatures, lane changes, and motion discrepancies. Rios-Cabrera et al. [12] presented an integrated solution to detect, track, and identify vehicles in a tunnel surveillance application, taking into account practical constraints, such as real-time operation, imaging conditions, and decentralized architecture. AdaBoost [13] cascade is used for vehicle detection, and a comprehensive confdence score integrates the information of all stages of the cascade. Jelača et al. [14] proposed a real-time tracking method of multiple nonoverlapping cameras in a road tunnel monitoring scene, using AdaBoost for vehicle detection. Te vehicle detector and a Kalman flter of average optical fow are used for tracking. Te ReID algorithm applies the projection feature similarity of a radon transform between vehicle images. Chen et al. [15] proposed a spatiotemporal successive dynamic programming algorithm to identify vehicles between pairs of cameras. Tey extracted features based on Harris corner detection and OpponentSIFT descriptors, considering color information [16]. Zhu et al. [5] proposed a synergistically cascaded forest model to gradually construct the linking relationships between vehicle samples with increasing alternative random forest and extremely randomized forest layers.
Te abovementioned methods generally focus on the extraction of hand-designed features of vehicle images, which can only show good performance in specifc scenes. Tese manual features are susceptible to the interference of a complex tunnel environment, and they are difcult to improve the precision of ReID.

Methods Using Multimodal Information.
As a vehicle is far from cameras and the illumination is insufcient, the image resolution is low. Due to their similarity, it is impractical to efectively identify HAZMAT vehicles without special markings only by appearance. Recent work on vehicle ReID has improved the model by combining multidimensional information of vehicle attributes such as type, color, time, and space information with appearance features.
To reidentify vehicles based on fusion diferent appearance information, Liu et al. [17] designed a network using BOW-SIFT [18], BOW-CN [19], and GoogLeNet [20] to extract texture, color, and semantic features, respectively. Handmade features are fused with the vehicle type and color features obtained through deep learning. Liu et al. [21] proposed PROVID, which makes full use of appearance features, license plates, camera locations, and semantic information to carry out a progressive search from coarse to fne in the feature domain and from near to far in physical space.
To reidentify vehicles based on spatiotemporal information, Zhong et al. [7] proposed a vehicle pose guide model using a spatiotemporal probability model based on the Gaussian distribution to predict the spatiotemporal motion of vehicles. A convolution neural network (CNN) was used to predict the driving direction of a vehicle and the results of visual appearance, and then, the driving direction and spatiotemporal models were fused. Shen et al. [8] proposed a two-stage framework incorporating complex spatiotemporal information to efectively regularize ReID results. A candidate visual-spatiotemporal path was generated by a chain Markov random feld model with a deeply learned potential function. A Siamese-CNN + Path-LSTM model takes the candidate path and pairwise queries to generate a similarity score. Jiang et al. [9] proposed an approach with a multibranch architecture and a reranking strategy using the spatiotemporal relationship among vehicles from multiple cameras.

Method
3.1. Overview. Typically, a tunnel surveillance system consists of a series of cameras C � C 0 , C 1 , C 2 , . . . , C M , with nonoverlapping visual receptive felds. A i �→ denotes the 2048dimensional appearance feature vector obtained from the i -th vehicle image through the image appearance feature extraction network, and S i → denotes the spatiotemporal feature vector of the i-th vehicle collected by the camera. Te spatiotemporal features involved are velocity v i , timestamp t i , and space position l i of the tunnel.
We use P a (i, j) to represent the similarity of the appearance feature vectors of vehicles i and j from upstream and downstream cameras and P st (i, j) to represent the similarity of the spatiotemporal features of the vehicle pairs. P(i, j) is the probability that vehicle pairs are identical after fusing multimodal information. Te inputs of the proposed model are vehicle image pairs (i, j) and their spatiotemporal features ( S i → , S j → ) involved velocity, timestamp, and space position in the tunnel. Te output is the probability P(i, j) of whether the pair of vehicle images is the same vehicle.
Te framework of the proposed method has three parts, as shown in Figure 2.
(1) Similarity calculation of vehicle appearance features.
Resnet50 [10] is used as the feature extractor to obtain a 2048-dimensional appearance feature vector of a vehicle. (2) Based on the spatiotemporal movement law of HAZMAT vehicles, we calculate the theoretical distance and the actual distance of the vehicle pairs. Te tunnel spatial discrepancy ε ij is used to evaluate Computational Intelligence and Neuroscience the diversity between the theoretical distance and the actual distance. (3) Similarity calculation of multimodal information fusion. Based on parts 1 and 2, the spatiotemporal and appearance similarity of the input vehicle image pairs are summed with a weight. We rerank the vehicle similarity of fusion information.

Appearance Features of Vehicle ReID.
Te vehicle appearance feature extraction network is shown in Figure 3. We use Resnet50 as the feature extraction backbone network and adjust each image to 256 × 128 pixels. Given an input image x i with label y i , the predicted probability of x i being recognized as class y i is encoded with a softmax function, represented by p(y i | x i ). ID prediction p(y i | x i ) is used to calculate ID loss [22]. Te model outputs ReID feature A i �→ which is used to calculate triplet loss [23]. Te output dimension of the full connection layer is changed to the number of vehicle IDs in the training dataset. Te ID loss treats the training process of vehicle ReID as an image classifcation problem [24], i.e., each identity is a distinct class. In the testing phase, the output of the pooling layer or embedding layer is adopted as the feature extractor. Te identity loss is then computed by the cross-entropy.
where N represents the number of training samples within each batch.
Te triple loss for feature extraction can reduce the intraclass distance of positive pairs and increase the interclass distance of negative pairs. Given a triplet (x a , x p , x n ), including an anchor image x a , a positive x p , and negative x n , the triplet loss is formulated as follows: where α is a margin and usually set to 0.3. N is the number of training samples within each batch. f(•) stands for the appearance feature extractor.
In this work, we use ID loss and triplet loss together for optimizing the model. For image pairs in the embedding space, ID loss mainly optimizes the cosine distances while triplet loss focuses on the Euclidean distances. Te feature vectors of the two losses are inconsistent in the embedding space. To address this problem, the BNNeck [22] is applied for more efective loss computation. BNNeck adds a batch normalization (BN) layer before the classifer FC layers of the model. Te feature before the BN layer is denoted as A i �→ . We let A i �→ pass through the BN layer to acquire a normalized feature a i → . In the training stage, the feature A i �→ is used to compute the triplet loss. Te feature a i → is used to compute the ID loss. Finally, the triplet loss and ID loss are combined to optimize the model. To train the ReID model, we combine ID loss and triple loss as follows: In the test stage, the appearance features ( A i �→ , A j �→ ) for input image pairs (i, j) are generated using the vehicle   Computational Intelligence and Neuroscience appearance feature extraction network. We use the cosine distance to measure the similarity between features and is expressed as follows:

Vehicle Spatiotemporal Features. Te motion of the vehicle is limited by its speed and spatiotemporal motion.
Te time that the vehicle travels through a pair of cameras should be within a reasonable range. In a highway tunnel monitoring system, the driving speed of a vehicle is within the range of 10-80 km/h. Te time interval of vehicle movement is afected by the camera installation position and the topological relationship of the tunnel and cameras. We analyze the motion law of the vehicle time interval between cameras in the VisInt-THV-ReID dataset. For each pair of cameras, the vehicle space interval can be modeled as a random variable that follows a certain distribution [6,7]. In order to derive the spatiotemporal similarity probability distribution of the vehicle, we propose a feature called spatial discrepancy. We introduce the spatial discrepancy by considering Figure 4(a). Tis fgure shows the spatiotemporal graph that relates vehicle i observed in upstream camera with another vehicle j observed in downstream camera. Te motion variables involved are velocity v i of vehicle i, timestamp t i , and space position l i of the tunnel. Te state vector S i → expresses the spatiotemporal state of vehicle i.
To construct the spatiotemporal similarity relationship between the vehicle pairs, we calculate the theoretical distance and the actual distance of the vehicle pairs and defne the indicator ε ij to calculate the diversity of the distances. According to the constant acceleration model, the theoretical distance of the vehicle is calculated as follows according to the upstream and downstream cameras of the tunnel: Te actual distance between the current position of the vehicle collected by the upstream and downstream cameras is expressed as follows: Te spatial discrepancy ε ij evaluates the ftness between the displacement estimate s ij and the actual distance l ij as stated in Figure 4(a). Te tunnel spatial discrepancy is expressed as follows: which is used to evaluate the diversity between the theoretical distance and the actual distance. Te spatial discrepancy ε ij is evaluated by the vehicle spatiotemporal features involving velocity, timestamp, and space position.
To maintain the consistency of the data structure of the multimodal data fusion, we maintain the consistency of the spatiotemporal similarity discriminant method with the appearance feature discriminant method and use the chord function to represent the spatiotemporal similarity probability distribution of the vehicle. Te P st (i, j) is defned as follows: As shown in Figure 4(b), P st (i, j) increases as ε ij tends to 0. Based on P st (i, j), we can determine candidate matching vehicles according to the spatiotemporal similarity in tunnels.

Vehicle ReID by Fusing Image and Tunnel Spatiotemporal
Information. To make full use of the vehicle appearance and spatiotemporal information, we establish a multimodal information strategy. Te vehicle ReID probability is defned as follows: where the weight coefcient, λ ∈ (0, 1), is used to fuse the spatiotemporal and appearance similarity.  Computational Intelligence and Neuroscience

VisInt-THV-ReID Dataset.
We verifed the efectiveness of the proposed method on the VisInt-THV-ReID (Te dataset is open-sourced at the following website: https:// github.com/jialei-bjtu/VisInt-THV-ReID) dataset, which is collected from four cameras deployed in Taijia Expressway Linxian No. 3 tunnel in Shanxi province, China, providing high-defnition video data of 6 million pixels and spaced at 300 meters. We collected video data for 10 hours daily over 3 days, from November 26 to 28, 2020, from 10:00 to 20:00. We annotated 10,048 pictures of 865 HAZMAT vehicles with their spatial position, speed, and timestamp information. To the best of our knowledge, this is the frst opensource HAZMAT vehicle ReID dataset. Te sample dataset is shown in Figure 5.
To mark the spatiotemporal and speed information of a vehicle, we must transform its spatial coordinates. Perspective transformation is used to transform the vehicle driving area under the camera vision to a fxed-size rectangle [11], as shown in Figure 6.
Te position (x i , y i ) of a vehicle in the camera feld of view in the tunnel is calculated as follows: where x i is the lateral distance of the vehicle from the left wall of the tunnel, y i is its longitudinal distance from the current camera installation position, (x o , y o ) is the lower midpoint of the vehicle object detection box in the image, and T is the transformation matrix defning the mapping between the original region and the transformation region.
Using the image sequence taken by the surveillance camera, the speed of vehicle i in the tunnel can be obtained as follows: where f is the frame rate of the monitoring camera, the spatial position vector l i obtained by the camera at time t i is (x i , y i ), and the spatiotemporal vector of vehicle i is We trained and tested the model on the VisInt-THV-ReID dataset, whose 10,048 images of 865 HAZMAT vehicles were divided into training, query, and test sets at a 10 : 1 : 9 ratio. Te training set had 433 HAZMAT vehicles and 4980 images. Tere were 432 HAZMAT vehicles in the query and test sets, with 432 vehicle images in the query set and 4636 in the test set.

Experimental Settings.
Te mAP [21] and cumulative matching characteristic (CMC) curve [25] were used to evaluate the performance of the proposed method on the VisInt-THV-ReID dataset. Te average precision for a query image is calculated as follows: where n is the number of images in the test set, N gt is the number of ground truths, P(k) is the current precision result of the k-th query image, and gt(k) is an indicator function. When the matching result of the k-th query image is correct, gt(k) � 1, and gt(k) � 0 when it is incorrect. Te mAP is calculated as follows: where Q is the number of pictures in the query dataset. Te CMC curve shows the probability that the correct matching image of the vehicle appears in the candidate lists. Te CMC of the k-th position is as follows: (v j ,t j ,l j )  Computational Intelligence and Neuroscience where gt(q, k) is an indicator function, which equals 1 when the ground truth of the q query image appears before the k position. We also used Rank-1, Rank-5, Rank-10, and Rank-20 in the feld of ReID to evaluate the model. Te above results show that the multimodal information fusion method is superior to the use of appearance or spatiotemporal information alone and verify the efectiveness of the proposed multimodal information fusion method. Table 2 shows the recognition precision of three baseline methods, PROVID [21], Visual + ST [7], and Siamese-CNN [8], comparing to that of Visual + ST-COS on the VisInt-THV-ReID dataset.

Appearance Feature Extraction and STR Spatiotemporal Fusion (PROVID).
Te method of PROVID extracts the appearance features of HAZMAT vehicles by the Resnet50 network and uses the STR method to measure the spatiotemporal relationship [21]. Te STR is defned as follows: where T i and T j are the timestamps for the vehicles i and j captured by the cameras. T max is the maximum time interval of vehicles passing through the tunnel. δ(C i , C j ) is the actual distance between the current position of the vehicles collected by the upstream and downstream cameras, and D max is the global maximum distance between any vehicles. We set D max as the length of the tunnel.

Visual + ST.
Te method of Visual + ST extracts the appearance features of HAZMAT vehicles with the Resnet50 network and uses a spatiotemporal model based on the Gaussian distribution to predict the probability of vehicles [7]. P stG (i, j) presents the similarity of the spatiotemporal features of vehicle pairs, and it is defned as follows: where ε ij is the tunnel spatial discrepancy as defned in equation (7).

Siamese-CNN.
Te method of Siamese-CNN uses a Resnet50 network to extract the appearance features of HAZMAT vehicles, and a multilayer perception network is applied to obtain their spatial and temporal relationships [8].
Te spatiotemporal branch computes the spatiotemporal compatibility. Given the timestamps (t i , t j ) and the positions (l i , l j ) of vehilces, the input features of the branch are calculated as their time diference ∆t(t i , t j ) and spatial diference ∆d(l i , l j ). Te scalar spatiotemporal compatibility is obtained by feeding the concatenated features, [∆t(t i , t j ), ∆d(l i , l j )] T , into a multilayer perception with two fully connected layers. Te outputs of the two branches are concatenated and input into a 2 × 1 fully connected layer with a sigmoid function to obtain the fnal compatibility between the two states. Siamese-CNN takes all visual, spatial, and temporal information into consideration. Te results show that the proposed method achieves the best performance. It improves mAP and Rank-1 by 9.7% and 4.2%, respectively, compared with PROVID. Tis indicates that the STR spatiotemporal measurement method is not accurate enough to express the spatiotemporal information of vehicles in road tunnels. Compared with Siamese-CNN,   Te bold values in Table 1 are the best values from the same column of data. Te bold values in Table 2 are the best values from the same column of data.  the proposed method improves mAP and Rank-1 by 17.5% and 3.0%. Since Siamese-CNN uses a multilayer perception network to train the spatial and temporal information of vehicles, the difculty of model training is decreased and the precision is not ideal. Compared with Visual + ST, the proposed method improves mAP and Rank-1 by 8.9% and 3.7%, respectively. Tis shows that the proposed cosine spatiotemporal model can more accurately express the spatiotemporal state of a tunnel compared with Gaussian distribution. Te CMC curves of all methods are shown in Figure 7.

Parameter Analysis.
We experimented with the parameters of λ in the interval of 0.1-0.9. Te best fusion result is achieved when λ equals 0.35. Te comparison results of the parametric experiments are shown in Table 3. It can be observed from the table that a larger λ would cause appearance features to dominate vehicle identifcation, while a smaller λ causes spatiotemporal information to dominate. Table 3 shows that λ can have an important efect on the fusion results, and λ is relatively insensitive to the results in the interval 0.3-0.7.

Conclusion and Future Work
In this study, we presented a vehicle ReID method based on the fusion of vehicle appearance and tunnel spatiotemporal information for the task of HAZMAT vehicle ReID in road tunnels. Te proposed method was evaluated on the VisInt-THV-ReID dataset. Tis study could play a role in promoting HAZMAT vehicle monitoring and trafc safety management in road tunnels. Our future work has two aspects. Based on vehicle ReID research, we will study multicamera vehicle tracking technology to collect vehicle trajectories. In addition, we will use the time-to-collision (TTC) to indirectly evaluate safety and study a tunnel accident risk prediction model based on the trafc fow state.

Data Availability
Te data that support the fndings of this study are openly available in GitHub at https://github.com/jialei-bjtu/VisInt-THV-ReID. Te bold values in Table 3 are the best values from the same column of data.
Computational Intelligence and Neuroscience 9