Visual Attention and Motion Estimation-Based Video Retargeting for Medical Data Security

,


Introduction
With the rapid development of high-tech medical imaging [1][2][3], blockchain technology [4], artificial intelligence [5], Internet of ings (IoT) [6], and 5G network [7], intelligent medical system [8] and intelligent diagnosis [9] are becoming more and more popular. However, data security threats [10] make protecting the security of medical data an urgent problem. e volume of medical video data is greater than typical data, which makes the execution of medical data security methods, such as data encryption [11] and integrity detection [12], long. Video retargeting [13] can greatly reduce the storage capacity of video data on the premise of preserving the original content information as much as possible. Medical video retargeting can obtain smaller volume of medical data, then reduce the execution time of data encryption and threat detection algorithm, and improve the performance of medical data security methods.
Video retargeting has one more time dimension than image retargeting. It needs to take consideration of the correlation between the contents of adjacent frames. By regarding video as a three-dimensional pixel space-time matrix, Rubinstein et al. proposed FSC [15], looking for and deleting common pixel seams between adjacent frames to eliminate content jitter. NCV [17] proposed by Wolf et al. combines the gradient map, face detection, and foreground motion to produce importance map and then uses mesh deformation to realize video retargeting. Nam et al. [24] proposed a video retargeting method based on Kalman filter and saliency fusion to reduce video content jitter, so as to enhance the robustness of video retargeting. Wang et al. [25] proposed a multi-operator method based on improved seam carving to realize video retargeting. Cho and Kang [26] proposed an interpolation video retargeting method based on image deformation vector network, which uses the displacement vector generated by a convolutional neural network to perform interpolation. Kaur et al. [27] proposed a spatiotemporal seam carving video retargeting method based on Kalman filter. e existing video retargeting methods mainly focus on the pixel relationship and foreground motion between adjacent frames. ese methods aim to ensure the shape of important content in the process of retargeting. However, the above methods do not consider the attention of users to the video content, nor the impact of background movement on retargeting, resulting in serious deformation of the important content or poor quality of retargeting results. Furthermore, the human visual system can quickly find the required information from the visual scene and locate the visual attention to the focus in the scene [28]. Consequently, besides moving objects and important targets, the attention focus also includes the areas where change is about to happen next moment, such as the place where the sun will rise before sunrise, the place where actors will appear on the stage before the performance, and the direction where the ball is moving to. is paper makes full use of the user's eye tracking data and the motion information of both the background and foreground in the video and proposes a video retargeting method based on visual attention and motion estimation to reduce the deformation of the important area. Firstly, clustering is carried out according to the eye tracking data to generate the visual attention energy map. en, the motion estimation map is obtained according to the corresponding feature points of the foreground and background between adjacent frames. irdly, importance map is generated by composing visual attention energy map, motion estimation energy map, and gradient map. Finally, video retargeting is performed by mesh deformation. e proposed method utilizes the attention attribute of the human visual system and the movement factor of content in video, so the retargeting result is more in line with people's visual requirements. e experimental results on public datasets show that the method in this paper is better than the compared method in protecting important area and reducing salient object jitter.

Proposed Method
As shown in Figure 1, the framework of the proposed VAMEVR (visual attention and motion estimation-based video retargeting) method mainly includes visual attention data clustering, salience detection, SIFT feature detection, motion estimation, mesh deformation, and so on.

Visual Attention.
In a video, the areas concerned by the human visual system are usually regarded as important areas. ese areas should be of increased energy to reduce deformation in the retargeting process. In this paper, the eye tracking data will be utilized as the basis of visual attention, and it will be abstracted as visual focus.
en, visual attention energy will be generated according to the visual focus.

Visual Attention Focus.
is paper takes the eyeball tracking data of DAVSOD [29] dataset as demonstration. As shown in Figure 2, the eyeball tracking data exist in the form of discrete points. rough observation, it is found that most eyeball tracking data points are presented as two clusters.
In this paper, the K-means method [30] is utilized to cluster the eyeball tracking data points into 2 groups. e center of each group is just the visual focus. Firstly, we randomly select 2 data points as the initial cluster centroid. Secondly, we divide the data points into 2 mutually exclusive clusters according to the Euclidean distance from each point to the initial selected data points. irdly, the average positions of each cluster are obtained as the new cluster centroid. en, repeat steps 2 and 3 until the centroid position does not vary. e example of the focusing result is presented in Figure 3. Figure 3(a) shows the original frame. Figure 3(b) shows the eye tracking data and focusing result. e white point is regarded as the eye tracking data, and the two red points are the centers of two clusters. Figure 3(c) shows the visual attention energy map.

Visual Attention Energy.
Visual attention energy indicates the attention of the human visual system to important position in the image. e greater the energy is, the higher the attention is, and vice versa.
Two cluster centroids described in Section 2.1.1 are denoted as P 1 (x 1 , y 1 ) and P 2 (x 2 , y 2 ). e distances from each pixel of the frame to P 1 and P 2 are separately set as r 1 and r 2 . en, visual attention energy e(x i , y j ) of each pixel position in the frame is defined as where W and H are separately the width and height of the video frame. e generated energy map is shown in Figure 3. Figure 3(c) shows the visual attention energy map, which is generated according to the cluster results of eye tracking data as shown in Figure 3(b).

Motion Estimation.
In a video, the background and foreground are usually moving. In addition, the moving direction and speed of background are different from those of the foreground. e human visual system pays greater attention to the direction where the object is going. For example, in the tennis video, the direction where the players run to will attract more attention. In the racing video, area in front of the car is paid more attention.
Between adjacent video frames, the motion distance and direction of the background and foreground can be calculated to predict the motion trajectory of the salient object. Both current position and the upcoming position of the foreground object are taken as important areas at the same time, which can protect the visual attention areas to reduce the deformation of these important areas in the process of   Security and Communication Networks retargeting and improve the visual effect of retargeting results.

Feature Detection.
In the background of a video frame, the mean values of displacement of the feature points are used as the base of moving speed. e same is for the foreground of a video. e position to be reached by the foreground significant object is estimated according to the moving speed.
en, both the current position of the foreground and the position to be reached after motion estimation are regarded as important areas.
SIFT (scale-invariant feature transform) [31] is a computer vision algorithm proposed by Lowe to detect regional features in images. e core idea of SIFT algorithm is to find extreme points in multiple spatial scales and calculate position, rotation, light, and scale invariants to describe the features in images. e SIFT algorithm has good robustness, recognition, expansibility, and efficiency.
In this paper, SIFT algorithm feature detection is used to detect the background and foreground motion information between adjacent frames. Also, 20 feature points with the highest reliability are selected as the basis for motion speed calculation. An example of feature points is shown in Figure 4.

Foreground Separation.
In a video frame, salient object is generally the foreground area. By salience detection, the foreground area can be separated from the background. Compared with other algorithms, SSAV [29] can obtain clearer and more accurate result. SSAV [29] is mainly composed of a pyramid deconvolution module and salience transfer perception module. e former is used to robustly learn static salience features. e latter combines the traditional long-term memory convolution network with salience transfer perception attention mechanism. is paper uses the SSAV [29] method to separate the salient foreground object from video frames.

Motion Detection and Estimation.
From SIFT feature points, we select n(n � 20) point with high reliability as the basis for motion detection and estimation. Concretely, SIFT feature points contained in the background are recorded as P bg (x bg , y bg ), and the number of those points is n bg . Similarly, SIFT feature points contained in the foreground are recorded as P fg (x fg , y fg ), and the number is n fg . From frame i to frame i + 1, the average moving speed of feature points in the background is recorded as V bg (dx bg , dy bg ).
Similarly, from frame i to frame i + 1, the average value of the moving speed of the feature points in the foreground is denoted as V fg (dx fg , dy fg ).
For a video, the estimated actual motion speed V act (dx act , dy act ) of the foreground is defined as the difference between the motion speed of the foreground and background.
As shown in Figure 5, after obtaining the salience map of the current frame, we calculate the edge of the salient region by the Canny [32] method. en, the edge is overlaid with the actual motion speed V act (dx act , dy act ) as the predicted position of the salient object. e polygon surrounding method [33] is used to obtain the external polygon of both current and predicted object contour. Finally, the area surrounded by the polygon is just the important region after motion estimation. e motion estimation energy map is the binary map of important area after motion estimation, which is shown in Figure 5(d).
When the salient object is too small or the features are not obvious, the first n (n = 20) feature points detected by the SIFT algorithm are wholly in the background area. In this situation, the centroid displacement of the salience object detected by SSAV is directly used as the moving speed of the foreground object to predict the position where the foreground will go. e points in salient object area are denoted as P * fg (xc fg , yc fg ). e number of those points is m fg . From frame i to frame i + 1, the motion speed of the foreground's centroid is denoted as V * fg (dx c fg , dy c fg ), where e actual motion speed V act (dx act , dy act ) of the foreground is the difference between the motion speed of the foreground and the motion speed of the background.

Importance Map Fusion.
e importance map is the direct basis for image retargeting. e visual attention energy map and motion estimation map obtained in the above steps need to be fused into the importance map.
We denote I eye as the normalized visual attention energy map, I grad as the normalized gradient energy map, I motion as the normalized motion estimation energy map, and I imp as the importance map. e coefficient w(0 ≤ w ≤ 1) is the weight of the visual attention energy map in the importance map, over the gradient energy map. en, the importance map I imp is defined as follows.
e parameter w(0 ≤ w ≤ 1) determines the visual effect of visual attention energy in importance map. e smaller w is, the smaller the proportion of visual attention energy is.
us, the impact of visual attention on the results in the retargeting process is smaller, and vice versa. e larger w is, the greater the proportion of visual focus energy is. erefore, the impact of visual attention on the results in the retargeting process is greater. When w � 0, the retargeting results only reflect the gradient information and motion estimation information, not the visual attention information. Also, when w � 1, the retargeting results only reflect the visual attention information and motion estimation information, not the visual attention information.

Mesh Deformation.
is paper uses Wang's method [18] for mesh deformation to realize video retargeting. e input frame is divided into quadrilateral mesh (V, E, F). V, E, and F represent the set of vertex, edge, and quadrilateral separately. Each quadrilateral is with a scaling factor s f . e average importance energy of each quad is w f . e quad deformation energy is defined as D u .
e grid line bending energy is described as D l .
e total energy D is the sum of D u and D l .
Wang's method [18] uses iterative solver to solve for mesh deformation. In each iteration, the scaling factor s f of each grid is calculated by local optimization, and then the mesh vertexes are updated by global optimization under the constraint of target image boundary conditions. e iterator will be terminated when the energy is no longer increased or the displacement of mesh vertexes is less than 0.5. e smooth scaling factors s f ′ are generated by minimizing the following energy.

e Algorithm of the Proposed Method.
e implementation steps of the proposed methods are shown in Algorithm 1.

Experimental Environment and Parameter Settings.
To validate the performance of the proposed method, we conduct experiments on a computer with an Intel i7-5500U@2.4 GHz CPU and 16 GB RAM. e proposed method was implemented in MATLAB R2016a on Windows. e number of visual attention data cluster k is set as 2. In the important map fusion process, the weight w of visual attention is set as 0.1, 0.5, and 0.9 separately.
In order to illustrate the universality of proposed method, the public dataset DAVSOD [29] is selected as experimental input. DAVSOD is a large-scale video salient object dataset, which mainly serves the evaluation of video salient object detection and video retargeting. DAVSOD contains 226 video sequences and 24000 frames, covering a variety of scenes, object categories, and motion modes. It is marked strictly according to human eye tracking data.

Experimental Result and Analysis.
We randomly select "select_0115" and "select_0194" videos of DAVSOD as input data of experiment. e data of "select_0115" include a tennis video clip with 105 frames and 640 * 360 pixels per frame. e data of "select_0194" include a motorcycle race video clip, with 133 frames in total, and the size of each frame is 640 * 360. In both of the above video data, the camera is moving during video shooting, that is, the background is moving.
e experimental results are shown in Figures 6 and 7. From Figures 6 and 7, we can find that the deformation of the salient area is small, especially the area in the direction the object moves to, which is well protected. As shown in Figure 7(d) concretely, the region, where the tennis ball in "0145" video frame is moving to, is with smaller deformation, and so is the area in front of the motorcycle in "0818" video frame. e main reason of above results is that the important area is of high energy by visual attention and motion estimation. In "0145" and "0150" of Figure 7(c), it can be seen that people paid more attention to the direction of the ball the player was going to move. Similarly, in "0815" and "0818" of Figure 7(c), people pay more attention to the forward direction of the motorcycle and less attention to the rear direction of the motorcycle.
Specifically, as shown in Figures 6(c), 7(a), and 7(c), the smaller w is, the weaker the effect of the visual attention is. e larger w is, the more obvious the effect of visual attention is.

Time Analysis.
e size of video frames and average processing time of each frame in this paper are shown in Table 1.
It can be seen from Table 1 that the time of FSC is longest, with 6.03 s per frame. e average time per frame of VAMEVR in this paper is 0.53 s. It is 0.24 seconds longer than SNS. e increased time is mainly used to calculate visual attention energy and motion estimation detection.  (2) Calculate the foreground speed V fg between Frame i and Frame i + 1 by (3) Calculate actual moving speed V act of the salient object by (4) and (5)  Else Calculate the background speed V bg between Frame i and Frame i + 1 by (2) Calculate the foreground speed V * fg between Frame i and Frame i + 1 by (6) Calculate actual moving speed V act of the salient object by (7) and (8) End If Estimate the position of foreground (x est i , y est i ) � (x cur , y cur ) + V act Calculate the circumscribed polygon R fg of both the estimated position and current position of the foreground Generate the foreground motion estimation map I motion according to the salient areas S r in polygon R fg Compose importance map I imp from visual attention energy map I eye , foreground motion estimation map I motion , and gradient map I grad by (9) Use the mesh deformation method described in Section 2.4 to produce retargeting result of Frame i End for Output result V result ALGORITHM 1: Video retargeting based on visual attention and motion estimation. 8 Security and Communication Networks

Discussion.
e human visual system is more sensitive to salient objects. e more consistent the displacement of salient objects in adjacent frames before and after retargeting, the lower the content jitter. In this paper, 30 frames of motorcycle racing videos are randomly selected for retargeting.
For the proposed VAMEVR, the centroid displacement of salient object in the retargeting result is basically the same as that of original video. When the weight coefficient of visual attention energy map is 0.1 and 0.9, the comparative analysis of horizontal and vertical displacement is shown in Figure 8. e displacement correlation of the salient objects can indicate the visual consistency between the original video and the retargeting result. e displacement of the centroid of the significant object in input video and retargeting result is denoted as X and Y separately. e covariance is defined as cov (X, Y), and the standard deviation of X and Y is (σ X , σ Y ). e Pearson correlation coefficient ρ X,Y is defined as follows.
As shown in Table 2, for VAMEVR, the displacements of the salient objects before and after retargeting are more positively correlated than SNS and FSC. e visual effects of our results are more consistent with the original video than SNS and FSC.

Conclusion
is paper proposes a visual attention and motion estimation-based video retargeting method for medical data security. Firstly, clustering is carried out according to the eye tracking data to generate the visual attention energy map. en, the motion estimation map is obtained according to the corresponding feature points of the foreground and background between adjacent frames. irdly, importance map is generated by composing visual attention energy map, motion estimation map, and gradient map. Finally, video   retargeting is performed by mesh deformation. Experiments show that the proposed method can protect important area concerned by the human visual system. e displacement of a salient object in retargeting results is more close to input video. erefore, the visual effect is more in line with human visual need. Our future work is to study the multi-object separation method and then study the video retargeting method based on multi-object motion estimation for medical data security.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.