Truck-Lifting Prevention System Based on Vision Tracking for Container-Lifting Operation

Truck-lifting accidents are common in container-lifting operations. Previously, the operation sites are needed to arrange workers for observation and guidance. However, with the development of automated equipment in container terminals, an automated accident detection method is required to replace manual workers. Considering the development of vision detection and tracking algorithms, this study designed a vision-based truck-lifting prevention system. +is system uses a camera to detect and track the movement of the truck wheel hub during the operation to determine whether the truck chassis is being lifted.+e hardware device of this system is easy to install and has good versatility for most container-lifting equipment. +e accident detection algorithm combines convolutional neural network detection, traditional image processing, and a multitarget tracking algorithm to calculate the displacement and posture information of the truck during the operation. +e experiments show that the measurement accuracy of this system reaches 52mm, and it can effectively distinguish the trajectories of different wheel hubs, meeting the requirements for detecting lifting accidents.


Introduction
Container terminals are facilities that provide storage and distribution services for container transportation. With the sustainable growth of global maritime trade, the development focus of container terminals has moved to automation and unmanned operations. Furthermore, terminals with a high level of automation are called automated container terminals (ACTs). Some of the advantages of ACTs are obvious, they use automated equipment to replace on-site workers, which improves operation efficiency and reduces operating costs; this also improves worker safety [1].
In the terminal operation process, containers need to be transferred between various storage areas to transfer equipment. ese transfer operations are called container-lifting operations and are performed by container-lifting equipment (such as rail-mounted gantry cranes (RMG)) [2]. e trucklifting accident is an accident that occurs in container-lifting operations, and an example is shown in Figure 1. When the container is lifted, the container lock pins are not released, and the truck is lifted with the container, which can negatively affect the container, trucks, and on-site workers.
In traditional container terminals, the container-lifting operation requires on-site workers to confirm whether the lock pin is fully released. However, the ACT requires a reduction in the number of on-site workers, and an automated accident detection method is required to prevent accidents.
Truck-lifting prevention can be considered as a target detection and tracking problem. It needs to detect and recognize the characteristics of the truck and then use it to calculate the displacement of the truck during the operation process. Existing solutions for truck-lifting prevention are based on laser scanners, such as the laser radar-based truck chassis positioning technology proposed by Chao-feng [3]. e laser scanner scans the contour information of target and restores it to a 3D model. By analyzing the geometry of the model, the size and position information of the targets is calculated by the system [4]. is technology has high detection accuracy and is not affected by weather or light conditions. However, this system relies on a high-precision laser scanner, which is expensive [5].
With the development of image sensors and computer vision algorithms, vision-based measurement (VBM) technology has become more widespread in recent years. is technology consists of only a camera and an image processing device, which makes its hardware cost much lower than the laser scanner solution. VBM technology has been widely used in industrial measurement and one of the typical applications of VBM is the automated inspection of product quality control [6]. In container terminals, vision-based detection technology is used in many applications [7], specifically in the recognition of complex features, such as container numbers [8] and container corner casting [9].
In addition to the lower equipment cost, vision-based detection technology has the following two advantages. One is that it can achieve higher measurement accuracy by noncontact measurement [10] because vision-based detection uses CMOS or CCD cameras to obtain image information; these devices have high image resolution. Another advantage is its ability to recognize complex features, which stems from convolutional neural network technology (CNN) [11].
CNNs can recognize and classify complex features from images, such as the classification of face features [12] and the classification of tumors [13]. Different recognition CNNs also have operability for training. Compared with the previous classification (such as SVM), CNN has achieved higher detection rate, detection accuracy, and calculation time [14,15].
Nevertheless, the detection accuracy of CNNs is not perfect. e detection results of CNN are the area with the highest probability that contains the target. ere is generally some deviation between the detected result and the target. However, traditional image processing has pixel-level accuracy and can achieve higher accuracy under the premise of successful detection.
Vision-based target tracking technology has been used in several applications, such as ship recognition and tracking based on video information [16,17] and vehicle tracking based on aerial videos [18]. ese technologies are usually based on detection-based tracking (DBT) [19], which is mainly because of the excellent target detection ability shown by CNN detection. e tracking principle of DBT is to use a CNN to detect the target from an image and then use the correlation algorithm to associate the same target in different frames [20].
is system has achieved a good tracking result, making the main problem of vision tracking change from detection to association.
is study proposes a truck-lifting prevention system based on a vision-based detection and tracking algorithm to provide a low-cost and easy-to-modify automated accident detection system for container-lifting operations. e system is based on a target detection method that combines CNN detection with traditional image processing algorithms and a DBT multitarget tracking algorithm. e system calculates the displacement of the truck wheel hub and determines whether an accident has occurred. is system supports real-time remote monitoring because it uses cameras to capture operation information. Moreover, the system can switch to manual monitoring when the accident detection algorithm fails, which is a function that the laser scanner solutions cannot achieve.

System Design and Control Principle
is system uses cameras as information capture devices, which makes it applicable for installation in most containerlifting equipment. Figure 2 shows the installation in the rail- mounted container gantry crane (RMG), which is a typical container-lifting equipment in a container terminal, and Figure 3 shows the actual installation of the cameras. At the operation site, container trucks only function on the truck road; therefore, the cameras were installed at the RMG leg to capture the image information of the trucks. Two sets of cameras were installed in the RMG leg to cover all the areas because the container truck has a long chassis. e lifting prevention process is shown in Figure 4. When the operation starts, the cameras capture the image on the side of the truck and sends it to the image processing unit (IPU) for calculation. In the IPU, the wheel hubs of the truck are detected first and then the movement trajectory of each wheel hub during the operation is tracked to determine whether the truck has been lifted. When an accident is detected, the IPU sends the accident information to the automated crane control system (ACCS), which stops lifting the container spreader by controlling the programmable logic controller (PLC). e reliability of the equipment was also considered. In the traditional operation process, the container-lifting operation needs to be guided by on-site workers. We did not install a backup system because it would add additional costs and complicate the communication systems. When the system fails, the traditional method is considered acceptable. As the camera is installed at a low position that can capture the image of the wheel, it can be easily wiped off when the lens of cameras is stained.

Truck-Lifting Detection Algorithm
ere are several types of container trucks, and therefore, it is difficult to directly recognize truck chassis and measure their displacement. However, truck tires have standard specifications, and considering that the tires deform under normal loading conditions, we calculated the displacement information of the truck chassis by detecting the coordinates of the wheel hubs on the image. Because the work site was an open-air environment, the light conditions were unstable, and the color and contamination conditions of different trucks were also different. We first used neural network detection to obtain a higher target recognition rate and then used traditional image algorithms to improve the detection accuracy. Finally, a deep sort-based tracking algorithm was used to distinguish and track different wheel hubs.

First Wheel Hubs Detection Based on the Modified SSD.
SSD (Single Shot MultiBox Detector) [21] is a feedforward convolutional network. It uses anchor boxes with different aspect ratios and sizes to sample the image, and several feature layers with different receptive fields are used to extract and classify features. Owing to this design, SSD has a higher detection speed than two-stage methods, such as Fast region-based convolutional network (Fast R-CNN) [22], making it suitable for real-time detection.
To achieve the best detection performance, we made some modifications to the SSD network. e original SSD has VGG-16 [23] as the convolutional layer, which was replaced with ResNet [24], a newer CNN model that uses deeper neural networks to extract more feature information.
e structure of the modified SSD model is illustrated in Figure 5.

e Second Detection Stage Based on Traditional Image
Processing. e result of SSD detection is not the target itself; however, it is an area that contains the target with the greatest probability. e detection results usually have a positioning error from the actual target position. However, traditional image processing algorithms have pixel accuracy, but for an entire image, it takes longer time to calculate; hence, requirements for real-time detection are not met. erefore, after SSD detection, we perform a second-wheel hub detection based on traditional image processing to improve the detection accuracy.
A flowchart of the second detection is shown in Figure 6. e input data are the wheel hub image that was detected by the SSD. e results detected by SSD are defined in (1), where x 0 and y 0 are the center coordinates of the detection result and s 0 and r 0 are the size and aspect ratio, respectively: e first part is the preprocessing operation. We used the Single-scale Retinex (SSR) algorithm to enhance the information in the dark area of the image because the operation site is open air and the light conditions are unstable. SSR was proposed by Jobson et al. [25], and it is based on Land's Retinex theory [26].
is enhancement algorithm uses a Gaussian wrap function to convolve the image, and its expression is as follows: where I i (x, y) is the original color value of the point (x, y) on the color channel I, R i (x, y) is the enhanced color value, And F(x, y) is the Gaussian wrap function and its calculation is shown in (3). C represents the scale value of Gaussian wrap; it means the neighborhood size of (x,y) during convolution operation. λ is a scale parameter; it must make (4) hold. e enhanced image is the merged result of each color channels: F(x, y)dxdy � 1.
Next, we used an adaptive HSV threshold to filter out the wheel hub area in the image. e HSV color space divides different colors by the hue H, saturation S, and brightness value V. Since to the wheel hub area usually is the higher brightness part in the image, it can be extracted from the image by filtering the lower brightness part of the image. e calculation of HSV thresholding is shown in (5); T V (x, y) is the pixel value of pixel (x, y) in the HSV V space and Th(x, y) is the new pixel value. Thresh V is the threshold value that is calculated from the average pixel value of the image (T V (x, y)), and the adjustment value is ε; its calculation is shown in (6 e second part is the detection of the contours on the preprocessed image, and the calculation of the largest circle in the contours is by Hough circle detection. Owing to the light condition and lens distortion, the wheel hub image at the edge of the image screen exhibit some deformation and defects. We used the adaptive method shown in (7) to adjust the threshold of the Hough circle detection accumulator so that the Hough circle detection can detect the largest circles with no perfect shapes. X in (7) is the horizontal resolution of the image, and A 0 and A 1 , respectively, represent the original threshold and adjusted threshold of the accumulator, and c is the adjustment ratio: e second detection result is defined as D 1 , as shown in (8). Because the second detection is unstable, when the second detection result D 1 and the first detection result D 0 have a large deviation, it should be considered as a failure of the second detection. erefore, the final detection result needs a re-evaluation, and we used (9) to estimate whether the result of the second detection is suitable as the final detection result. N 0.95 is the maximum error with 95% confidence of the first detection, which is calculated by normal fitting:

Trajectory Tracking Based on the Modified Deep Sort.
Deep Sort [27] is an online multiobject tracking algorithm proposed by Wojke et al. in 2017. As a DBT algorithm, Deep Sort's tracking is based on the detection result data, making it suitable for combination with CNN detection or traditional image processing detection. We modified the tracking process of Deep Sort to improve the tracking speed, and the new tracking process is shown in Figure 7. e Deep Sort uses the state vector shown in (10) as the description model of the targets. u and v are the center coordinates of the target detection result, and c and h represent the aspect ratio and height of the target detection result, respectively. _ u, _ v, _ c, and _ h are the predicted target positions in the next frame, which are predicted by Kalman fitting, an algorithm that uses a series of measurements observed over time to produce estimates of unknown variables: e predicted result is used to match the detection results in the next frame. e matching algorithm is based on the Kuhn-Munkres algorithm, which uses the IOU value of the prediction result and detection result as the weight to classify different tracking targets. e calculation of IOU is shown in (11), Dete is the detection result, and Pred is the prediction result. e detection result closest to the prediction result was classified as the same target: To solve the problem of target loss when the target passes through obstacles, the original Deep Sort uses the Mahalanobis distance and the descriptor of the target after the convolution operation to match the detection result and existing trajectories. However, the calculation of the convolution descriptor requires longer time, which makes the calculation time of Deep Sort much longer than Sort [28]. erefore, we only used the Mahalanobis distance as the standard for trajectory matching.  Journal of Advanced Transportation e calculation of the Mahalanobis distance is shown in (12), d i,j is the motion matching value between the trajectory i and the detection result j, S i is the covariance matrix of the observation space in this frame, which is obtained by the Kalman filter.
Because the motion of the target is continuous, the Mahalanobis distance can be used to screen the detection results. e calculation method of the screening is shown in (13). t (1) is the threshold that is defined by the chi-square distribution with 0.95 degrees. When d i,j is lower than the threshold, it means that the trajectory I is associated with the detection result j.
Because the motion of the target is continuous, the Mahalanobis distance can be used to screen the detection results. e screening calculation method is given in (13). t (1) is the threshold defined by the chi-square distribution with 0.95 degrees. When d i,j is lower than the threshold, it means that trajectory I is associated with the detection result j:

Experiment
e core of this lifting prevention system is the wheel hub detection and tracking method. We used a typical industrial computer configuration to verify our method, the specifications of which are as follows: CPU : Intel i7-6700 GPU : Nvidia GeForce GTX970-4 GB. e first detection algorithm was implemented by Pytorch [29], and the second detection and tracking algorithm was implemented using OpenCV [30] in a Python environment. e images used in the experiment were captured by a camera, as shown in Figure 2. e resolution of the image was 1920 × 1080, and fps was 24.

Evaluation of Wheel Hub Detection.
e modified SSD training used 3000 images from the side of the container truck. e trucks in these images were driven on the truck road next to the RMG, and the distance between the camera and the trucks was approximately 4-6 m. e first detection result is presented in Figure 8. e performance evaluation of the first detection used 500 images for testing, and the evaluation of the second detection used 500 images that only contained the tire part.
e test results are listed in Table 1. e horizontal error is the distance between detection result and the center of wheel hub in the direction of truck road, and the vertical error is the distance between detection result and the center of wheel hub in the vertical direction; both error values are the 95% confidence value after normal fitting. e actual distance was estimated with reference to the pixel size of the wheel hub.

Evaluation of Wheel Hub Tracking.
e tracking algorithm evaluation used several videos of trucks passing through the camera area at normal speed and several videos of trucks under the container-lifting operations. e former was used to test the track of the horizontal displacement of the truck, and the latter was used to test the vertical displacement. ese videos were under normal light conditions during the daytime and night-time, and the tracking results are shown in Figures 9 and 10. Table 2 lists the performance of the target tracking algorithm. e tracking error is defined as the distance between the detection result and the prediction result, and it is the maximum error at the 95% confidence level after normal fitting.

Discussion.
e experimental results showed a detection error of 6.31 pixels (in the experimental environment, it was approximately 52 mm), and the total tracking rate (including the detection time) reached 10 fps (average of 2.5 tires per image). Because the maximum vertical displacement in the container-lifting operation was approximately 100 mm, the detection accuracy of this system met the requirements of truck-lifting prevention. However, the experimental results also showed some issues.
In the detection experiment, certain detection failures were observed, and these failure detection samples were focused on the second detection. We observed that these failures were caused by tires with defaced and low light environments, and such factors obscured the details of the wheel hub. In this study, we used the processing of pixel values in the HSV space to solve this problem, but the experimental results showed that it is not sufficient. When the tire appeared on the edge of the image, the detection error of the second detection increased. is is because the cameras have some lens distortion, which is caused by lens distortion and coordination.
is causes distortion of the edge part of the image.

Conclusion
To solve the problem of automated accident prevention in container-lifting operations, this study designed a vision-based truck-lifting prevention system that calculates the displacement of the truck wheel hubs to determine whether the truck is lifted. e experiment showed that the detection accuracy of this system reaches 6.31 pixels and the average fps is 10 frames, which is sufficient to detect the truck-lifting accident in time.
However, certain limitations were also observed. We believe that an algorithm to extract the contour characteristics from the tire images with defaced and low light environment should be explored. Considering that the convolutional neural network is insensitive to different defaced and light conditions, it may be possible to use the convolution operation to extract detailed information in the picture to avoid the interference of light and defacement. However, complex calculations will increase the calculation time and reduce the efficiency of the system; therefore, this problem needs to be resolved.   Data Availability e experiment data used to support the findings of this study have been deposited in the Google Drive repository (https://drive.google.com/file/d/1mqZrmlnOMwxeLsM9pB xItZ_jMj4qsRrV/view?usp�sharing).

Conflicts of Interest
e authors declare that they have no conflicts of interest.