Extracting Vessel Speed Based on Machine Learning and Drone Images during Ship Traffic Flow Prediction

In the water transportation, ship speed estimation has become a key subject of intelligent shipping research. Traditionally, Automatic Identi ﬁ cation System (AIS) is used to extract the ship speed information. However, transportation environment is gradually becoming complex, especially in the busy water, leading to the loss of some AIS data and resulting in a variety of maritime accidents. To make up for this de ﬁ ciency, this paper proposes a vessel speed extraction framework, based on Unmanned Aerial Vehicle (UAV) airborne video. Firstly, YOLO v4 is employed to detect the ship targets in UAV image precisely. Secondly, a simple online and real time tracking method with a Deep association metric (Deep SORT) is applied to track ship targets with high quality. Finally, the ship motion pixel is computed based on the bounding box information of the ship trajectories, at the same time, the ship speed is estimated according to the mapping relationship between image space and the real space. Exhaustive experiments are conducted on the various scenarios. Results verify that the proposed framework has an excellent performance with average speed measurement accuracy is above 93% in complex waters. This paper also paves a way to further predict ship tra ﬃ c ﬂ ow in water transportation.


Introduction
In recent years, the development of economy has made the shipping industry develop rapidly, while the shipping system [1,2] has gradually become intelligent with the help of the computer technology. To further ensure the safety of ship navigation, ship speed estimation has become an important topic in the study of intelligent shipping, and the popularity of ship automatic identification System (AIS) [3] makes it one of the main channels to obtain ship speed information. However, the rapid development of shipping makes the maritime navigation environment more complex [4], and the AIS signals are prone to be interfered, resulting in data loss. Therefore, it is of great significance to propose an auxiliary method for real-time and accurate estimation of ship speed.
At present, the ship speed estimation technologies besides AIS also include radar technology [5], laser technology [6], sonar technology [7,8], video-based technology [9], and so on. Among them, infrared technology is susceptible to weather. Radar and laser meters can avoid these problems, but they are usually more expensive and maintain frequently. Sonar technology takes advantages of the low-cost reflection characteristics for velocity measurement, but the measurement time is quite long and the effective distance is small. In contrast, video-based technology does not require complex and expensive hardware. It is not only easy to install and maintain, but also has a larger monitoring range. With the development of computer vision technology in recent years, the method is becoming increasingly popular in intelligent monitoring systems.
Most video-based speed estimation systems are based on fixed cameras [10]. Reference [11] proposed a vehicle speed estimation method for extracting traffic speed information. The authors used a detection and tracking algorithm in deep learning technology to capture the vehicle position in the image, and then calculated the relationship between image space and real world to forecast the ship speed. In [12], a novel two-camera-based vehicle speed framework was presented. They estimated the speed of a vehicle by calculating the specific geometry relationship in two camera spaces. This fixed camera-based method had a limitation in monitoring range and might not be cost effective in a busy road segment.
Furthermore, based on the UAV video, reference [13] proposed the average speed measurement method, that is, the optical flow and clustering algorithm, which were used to cluster the optical flow vectors to calculate the average traffic flow speed. In reference [14], a speed estimation method was designed by searching the correspondence between scale Invariant Feature Transform (SIFT) features in video frames, and the above methods have achieved relatively excellent results in the process of road traffic speed measurement. In addition, some methods [15,16] for measuring ship speed were developed based on the ship motion characteristics in synthetic aperture radar (SAR) images. Reference [17] proposed a method to estimate ship speed based on the relationship between the wavelengths that constitute the Kelvin mode and ship speed. It can estimate the ship speed accurately when the main features and the turbulent wakes of a ship are clearly displayed in SAR images. However, the applicability of this method may be limited in the case of ship displacement.
In the above studies, the UAV remote sensing videos generally have the advantages of high pixel and large monitoring range [18], and are suitable for estimating ship speed in the field of waterway traffic. However, due to the slow speed of the ship, the motion trends are fuzzy and the navigation trajectories are complex [19,20]. There are few studies on ship speed estimation based on UAV remote sensing images, and the real-time and continuity need to be improved.
To make up for the shortcomings of the existing ship speed measurement technology, a multi-ship speed measurement method based on the UAV videos is proposed after carefully learning previous researches. In our framework, a self-built UAV dataset containing 18,655 images is built, the detection and tracking algorithms based on deep learning technology are used to obtain a series of positions in the images, the transformation relationship between image space and real space is calculated to provide the basis for ship velocity measurement.

Methodology
2.1. The Framework. The framework contains three parts: the first part is ship detection, the second part is ship tracking, and the third part contains camera calibration, ship trajectory extraction, and ship speed estimation, as shown in Figure 1.
In the first part, the images of ships are captured by using the UAV camera in a variety of environments. Then, the image annotation tool called labelImg is used to annotate the ground truth box of ship, and thus a dataset is formed. Considering there are many small ships in the image, YOLO v4 algorithm is employed for training model. When the loss function drops rapidly and stably, the weight and bias are saved, and the ship detection model is obtained. Finally, the detection model is used to detect different ships and obtain a series of ship targets.
In the second part, according to the bounding box information in the previous frame, firstly, the Kalman filter algorithm is used to predict and update the tracks. Then, the Intersection-over-Union (IoU) and matching cascade are used to associate the tracks with the bounding boxes in the current frame. Finally, ship tracking is achieved and many stable ship positions in the image are obtained.
In the third part, based on the track information obtained in the part two, the ship trajectories are extracted by calculating the pixel difference between the tracks in the previous and current frame. Then, the camera inner parameters and the mapping relationship matrix are obtained by using camera calibration. Finally, combining the mapping relationship and an average speed formula, the real distance and the ship speed can be estimated.

Ship Detection.
To detect the ship targets [21,22] in the drone video stably, we focus on the deep learning detection algorithm. Because of the high position of UAV, the drone video has a wide field of view without any target occlusion. However, as many small ships sail in the waters, the background in the image is complex, and the motion blurring will be caused by the vibration of the UAV [23]. Therefore, choosing a detection algorithm with high accuracy is critical. Currently, several detection algorithms that have been proposed and mainly divided into two categories, two-stage algorithms and one-stage algorithms. In two-stage, the representative algorithms include SPPNet [24] and faster RCNN [25], while in one-stage, YOLO [26] and SSD [27] are usually used. The advantage of the two-stage algorithm is excellent accuracy, but for many devices with limited storage and computing capacity, the computing cost is high. In contrast, the one-stage algorithm achieves the current optimum in terms of detection speed and accuracy. Being onestage, YOLO v4 algorithm [26] has obvious advantages for small targets in accuracy and efficiency. Therefore, YOLO v4 is selected for ship detection in this paper, which consists of 5 parts: input, backbone, neck, prediction, and output, as shown in Figure 2.
To achieve a better balance in terms of the input image resolution, the convolutional layer number, and the hyperparameter number, the CSPDarknet53 is employed as the backbone of YOLO v4. The CSPDarknet53 network consists of 53 convolutional layers and 23 residual blocks and is formed by applying the Cross-Stage-Partial (CSP) strategy to Darknet53. After the image passes to the CSPDarknet53, a series of feature maps at different levels are obtained. The neck can be composed of the spatial pyramid pooling (SPP) module and path aggregation network (PANet). The SPP module extracts the spatial feature information of different levels by increasing the receptive field and separates the most important contextual features. The PANet introduced a bottom-up path to make low-layer information easier to spread to the top layers. In the prediction stage, the multiclass classification and bounding box regression are applied to multiple scales. To measure the difference between the prediction and the actual and to optimize the 2 Journal of Advanced Transportation weights, the value of the loss function needs to be calculated. The loss function in YOLO v4 concludes bounding box regression loss, classification loss, and confidence loss, whose expressions are as follows: In bounding box regression, taking the overlapping area, center-of-mass distance, and aspect ratio simultaneously into count, the mean square error (MSE) method is removed and replaced by the Complete-Intersection-over-Union (CIoU) algorithm. Thus, better convergence speed and accuracy can be obtained in the training model. Finally, the fast non-maximum suppression (Fast NMS) method is used to retain the local maximum value of the confidence score to obtain the final result.

Ship
Tracking. Based on the bounding box predicted in the detect stage, Deep SORT [28] tracking algorithm is considered to track the ship targets in the UAV images. Deep SORT is originated from SORT, which uses Kalman filter and Hungarian   Journal of Advanced Transportation algorithm to handle motion prediction and data association problems [29]. Deep SORT consists of four parts: Deep Appearance Descriptor, Kalman filter, Matching Cascade, and Matching Intersection-over-Union, as shown in Figure 3.
In the first part, The ReID network is employed to extract the appearance feature of the target. The ReID architecture is shown in Table 1. It should be noted that the global feature map with 128 dimensions is calculated in dense layer 10 and normalized by using L 2 normalization at the last layer.
Part two, according to the motion state of ship targets in the previous frame, the positions of the targets in the current frame are predicted by Kalman filter by using a linear isokinetic model ðx, y, r, h, x ′ , y ′ , r ′ , h ′ Þ, containing the observed variables ðx, y, r, hÞ, the bounding box center position ðx, yÞ, aspect ratio r, height h, and their respective velocities.
In part three, the matching cascade is used for fusing target motion and appearance information to decrease the uncertainty of Kalman filter, when the target is occluded Results wherein, d j denotes the motion state of the j th detection box, y i represent the predicted motion state of the i th tracker, and S −1 i is the covariance matrix in the observation space. Appearance metric d 2 ði,jÞ measures the smallest cosine distance between i th tracker and j th detection bounding box in the appearance space, as shown in the following formula: where the r j is the appearance feature descriptor, jr j j = 1, and the R k is a gallery to store a series of feature descriptors. The correlation metric c i,j is combined as: where the λ is the weight coefficient.
In the last part, the matching IoU is applied for calculating the intersection and concurrency between the detection bounding box and the predicted box by Kalman filter, when there is no occlusion or new target appears. The detection will be identified if the matching value is greater than IoU min , and the mathematical formula is where A is the detection bounding box, and B represents the prediction bounding box of the candidate trajectory.

Ship Speed Measurement.
In this section, we assume that T k and T k+FPS are the trajectories in the frame k th and ðk + FPSÞ th frame firstly. Then, the pixel difference Δu and Δv in axis x and y are computed based on the center point of the bounding box of the trajectories, and the mathematical formula is as follows: In a camera model, keyhole imaging is a relatively simple and commonly used model. Keyhole imaging principle as shown in Figure 4. And the related formula is as follows: where f represents the focal length, v is the image distance, u is the object distance. The coordinate systems involved in solving Keyhole imaging parameters including world coordinate system, camera coordinate system, image coordinate system, and pixel coordinate system. The coordinate definition of the points in the specific coordinate system and the relationship between coordinate systems are as follows: (1) World coordinate system. An absolute coordinate system in a three-dimensional world. The position of the object is expressed in ðX w , Y w , Z w Þ The relationship among the image coordinate system, the camera coordinate system and the world coordinate system are as shown in Figure 5.  where R is orthogonal rotation matrix and t is a three dimensions translation vector. The conversion formula of image coordinate system to camera coordinate system is as follows: The above relationship is represented by a matrix form: The relationship between the image coordinate system and the pixel coordinate system is as shown in Figure 6.
The transformation relationship formula between the image coordinate system and the pixel coordinate system is: where ðu 0 , v 0 Þ is the representation of the origin of the image coordinate system in the pixel coordinate system. d x and d y are the scale factor between two coordinate systems in direction x and y. The upper formula is represented by a matrix form: thereinto:  Figure 5: The relationship diagram among the image, the camera, and the world coordinate system.
among them, ða x , a y , u 0 , v 0 Þ is the inner parameter, K is called internal parameter matrix. M is called the camera's outer parameter matrix, R and t are external parameters. Based on the value of the ðΔu, ΔvÞ, the mapping relationship between two-dimensional and three-dimensional space is calculated by the camera calibration [30]. Since there is no rotation and translation, the equation of spatial mapping relationship is as follows: where the Z C is the displacement of the camera from the object, K is the conversion matrix, the ðΔX w , ΔY w , ΔZ w Þ donates the displacement in the actual space, and the speed of the ship can be estimated according to the following equation: where ΔL ðt+FPSÞ i and v ðt+FPSÞ i represent the real distance and ship speed in frame t + FPS, respectively. The ship speed estimation process is shown in Figure 7.

Procedure
In this part, we introduce our dataset for ship detection in the airborne video and some parameter values set by us. Our dataset is originated from the UAV video (FPS is equal to 23.98) taken by DJI Genie 4 Pro, in Shanghai, China. The dataset comprises 18,655 images, including three scenes with four shooting angles and two lighting conditions, and the image resolution is 3840 × 2160. Besides, the UAV flying height and focal length are 500 m and 24 mm, respectively. Dataset image and schematic diagram of the shooting angle are all shown in Figure 8. Some images of our dataset have been published on the website: https://github.com/NZII. Those experiments are operated on a computer with an Intel I7-11800H @ 4.6 GHz processor and 6G memory.
In part of ship detection and tracking, YOLO v4 and Deep SORT algorithm are used and the parameters of the original model are retained to ssaccelerate the convergence of our model. Then, according to the video information, the interval of speed measurement is set to 1 s. Next, according to the UAV flying height, transformation matrix K is obtained, as is shown in Equation (25), that is, 1 pixel in the image is approximately equal to 0.205 m in real world. Finally, the ship speed is estimated successfully.
Owing to verify the performance of our system, the quality assessment experiment and method comparison experiment are construed. In these assessment experiments, the results of ship speed estimated by our method are compared with the ship AIS data and an excellent accuracy of our method is got successfully.

Multiscenario Detection and Tracking.
In this section, the detection and tracking performance of our method are evaluated by selecting 3 scenes with the resolutions 3840 × 2160 shown in Figure 9. Moreover, these scenes contain 3 shooting angles and 3 brightness conditions.
To test the detection and tracking stability, according to the video # 1 with 45°and 60°shooting angles, scene 1, greysea-surface background, bright illumination, and some of the ships in different positions of the image are conducted. Scene 2, background and illumination are similar to scene 1, but the overexposed appears in the bottom of the image, which is originated from video # 2 with the shooting angle 90°to verify the detection and tracking accuracy in general conditions. Moreover, we can also clearly see that some ships pass through the overexposed area and others are not.
Referring to scene 3 with dark background and overexposed area in the bottom of the image, it is constructed based on video # 3 with the shooting angle 90°. It is used to test the accuracy of detection and tracking in dark conditions.
The detection and tracking results are shown in Figure 9, three scenes include many ships with three sizes like small, middle, and big, and different kinds of ship distributions. In scene 1, our detection model has an excellent accuracy because of the mosaic data enhance of the YOLO v4 network, which especially improves the ability of detection small targets in the image. # 1272 and # 1680 are taken in two shooting angles (45°and 60°), the Kalman filter algorithm and Hungarian algorithm in Deep SORT can predict and match different directions of detection box, thus even if the angle has changed in the process of ship moving, we also can obtain the ship position.
In scene 2, we can clearly distinguish the trajectories of different ships, both the shooting angles and the 8 Journal of Advanced Transportation illumination conditions are also excellent. Due to the CSP strategy in the YOLO v4 network, a series of ship feature map information in different kinds of levels are obtained. Besides, Kalman filtering is very suitable for linear prediction scenarios, which ensure our model having better robustness and accurate in a good condition. In scene 3, the illumination is dark, which will affect the detection accuracy and tracking stability. The SPP and SPPNet modules in YOLO v4 can separate the most important contextual features and make low-layer information easier to spread to the top layers. Furthermore, the appearance information in Deep SORT can decrease the uncertainty of Kalman filter, when the target is missed for a long time due to the bad environment condition. This experiment proves that our method has an excellent detection and tracking accuracy, which is crucial for the next step calculating vectors and extracting speed.

Multi-Ship Speed Quality Analysis and Method
Comparison. In this part, AIS data is set as the truth value of ship speed and scenario 3 shown in Figure 9 is selected. AIS data of Baosteel terminal on April 5, 2021 is downloaded from the website: http://www.shipxy.com. Because of AIS data, a discrete spatiotemporal sequence cannot be matched with the continuous ship speed value estimated by our method. Therefore, based on the FPS of drone video, linear interpolation is employed to make AIS data continuous. Then, two factors, the sailing position and the illumination condition, are considered in the measurement process of ship speed, HUA QING and HUAI CHANG in scenario 3 are selected eventually.
Finally, AIS information is set as a standard, our method is compared with the visual line method. The speed comparison results and error curves of the three ships are shown in Figure 10. Moreover, mean absolute error (MAE) and mean squared error (MSE) are introduced to estimate our velocimetry method, comprehensively. The final evaluation results are shown in Table 2. Evaluation indicator calculation formula is as follows: where n is the number of speed measurements, y i is the value of ship speed estimated by our system, andŷ l is the value of ship speed in AIS data. As shown in the curves of HUA QING, we can clearly see the ship move in a stable speed. The ship speed measured by visual line is quite different from the truth value, MAE is 0.91 Kn and MSE is 2.0 Kn. The ship speed estimated by our method is relatively close to the true value and appears some reasonable fluctuation, MAE is 0.59 Kn and MSE is 1.25 Kn.
For the tachymetry curves of HUAI CHANG, the visual line method has been greatly improved in terms of accuracy, the MAE is 0.41 Kn and the MSE is 0.40 Kn. Significantly, our method still maintains an excellent performance, MAE is 0.31 Kn and MSE is 0.34 Kn.
Overall, the performance of the visual line method is uncontrollable in the process of ship speed measurement, the main reason is the visual line method only can estimate an interval average speed in a period of time, which cannot represent a real-time speed information and easily cause a global error. On the contrary, our method shows real-time ship speed information with a relatively good accuracy. Although our method is relying on the information of the bounding box, which may lead to some fluctuations and have an influence on the ship speed accuracy, ship speed data can be obtained in real time and the average accuracy is above 93% in the complex scenario.

Conclusions
In this paper, we propose a methodological method for extracting ship speed by combing the machine vision algorithm and the UAV images. In the detection and tracking experiments, we take into account the shooting angles and lighting conditions to test the robustness of the model. The detection and tracking results show that our model still has good detection accuracy in dense scenes with poor lighting scenes, and provides the basis for subsequent speed measurements. In the speed measurement experiments where the AIS is set to true value, we compare our method with the virtual line method by selecting three ships with different sizes in the same scene. The results show that our method has obvious advantages in terms of accuracy, stability, and applicable scenarios. Specially, the speed accuracy is above

Data Availability
The https://github.com/NZII data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.