Dynamic Warping Network for Semantic Video Segmentation

,


Introduction
Semantic segmentation aims to assign a specific semantic label to each pixel for a given image. In recent years, the models based on deep learning [1][2][3][4][5] have brought the performance of the task to a new level. However, most existing methods are only designed for parsing images and may produce inconsistent results to video frames, due to lack of temporal information.
To address the problem, many methods tend to incorporate temporal information of the video to improve the accuracy of video segmentation. And optical flow, which encodes the temporal consistency across frames in the video, has been widely used for semantic video segmentation. Gaddel et al. [6] propose to combine the features wrapped from previous frames with optical flow and those from the current frame to enhance the features. Studies of [7][8][9] use feature warping for acceleration.
However, there are two main problems with existing warping-based methods. Firstly, the optical flow obtained by the traditional algorithms or optical flow estimation networks [10][11][12] cannot accurately estimate the motion of all pixels across adjacent frames. Second, the warping operation adopted by previous methods [6,7,13] is implemented with standard bilinear interpolation and does not contain any learnable parameters. erefore, warping features relying on the imprecise optical flow may result in feature misalignment between the warped features and expected ones. TWNet [9] introduces a correction stage after warping to refine the warped features. However, the method has some limitations, because it needs motion vectors and residuals in the compressed video according to a specific compression standard.
In this paper, we propose a novel framework named Dynamic Warping Network (DWNet) to adaptively warp the interframe features for improving the accuracy of warping-based models. First, we design a flow refinement module (FRM) to optimize the precomputed optical flow and produce more accurate pixel displacement for every pixel location. Besides, we propose a flow-guided convolution (FG-Conv) to achieve the adaptive feature alignment based on the refined optical flow instead of the original warping operation. Furthermore, we introduce the temporal consistency loss including the feature consistency loss and prediction consistency loss to explicitly supervise the warped features and guarantee the temporal consistency of video segmentation, as shown in Figure 1. Our DWNet adopts extra constraints to improve the temporal consistency instead of simple feature fusion and feature propagation [6,7], which makes the network explicitly model the temporal consistency of the video in the training phase. And, in the inference phase, the optical flow network, the flow refinement module, and the flow-guided convolution can be removed. Hence, the final network can be regarded as a semantic image segmentation network with no postprocessing during inference.
We evaluate our DWNet on two semantic video segmentation benchmarks: Cityscapes and CamVid. Extensive experiments show that our DWNet can significantly outperform existing warping-based methods and achieve stateof-the-art accuracy on the two benchmark datasets. In particular, our DWNet can achieve consistent improvement over various strong baselines, which demonstrates the generalization ability of our method.
To conclude, our main contributions are five-fold: (i) We propose a novel framework named Dynamic Warping Network (DWNet) to adaptively warp the interframe features (ii) We design a flow refinement module (FRM) to optimize the optical flow and propose a flow-guided convolution (FG-Conv) to adaptively align features across adjacent frames according to the refined optical flow (iii) We explicitly model the temporal consistency of the video and introduce the temporal consistency loss to supervise the warped features (iv) Our DWNet needs no additional parameters and calculation during inference because the optical flow network, the flow refinement module, and the flow-guided convolution can be removed in the inference phase (v) e experimental results demonstrate that our DWNet can outperform previous warping-based methods and achieve state-of-the-art accuracy on the Cityscapes and CamVid datasets

Semantic Video Segmentation.
Semantic video segmentation aims to carry out dense labeling for all pixels in each frame of a video sequence. Compared with semantic image segmentation, semantic video segmentation needs to focus more on the temporal consistency of consecutive frames and produces a more consistent interframe prediction. erefore, many works tend to incorporate temporal information of the video to improve the video segmentation accuracy, including optical flow-based feature warping [6,8,9,[13][14][15][16][17], propagation-based [18,19], LSTM-based [15,20], 3D CNN-based method [21], and the weakly supervised method [22]. And optical flow, which encodes the temporal consistency across frames in the video, has been most widely used for semantic video segmentation. e optical flow-based methods first compute the optical flow between the current frame and the previous frame and then enhance the features of the current frame by warping the features of the previous frame or utilize the warped features from the keyframe as the features of the current frame for acceleration. Despite its relative strength, the optical flowbased feature warping contains two main problems as discussed above. TWNet [9] and DMNet [23] propose to correct the warped features by utilizing the postprocessing, which only brings a slight improvement. To our best knowledge, we are the first to directly optimize the warping operation and propose the learnable dynamic warping operation instead of the original one.

Dynamic Convolution.
e study [24] proposes dynamic filters or kernels to generate context-aware filters which are adaptive to the input and are predicted by the network. Many works [25,26] have adopted the predicted dynamic filters to obtain better feature representations. Deformable convolution [27,28] utilizes the input features to generate different offsets and weights for each sample position. Motivated by deformable convolution, we observe that the optical flow can be regarded as the offset and we can utilize the offset to adaptively align interframe features. Different from the deformable convolution whose offsets are generated by the input features, we utilize the flow refinement module to optimize the optical flow and obtain more accurate pixel displacement. Furthermore, we propose a flowguide convolution to dynamically warp the features based on the refined optical flow and achieve better feature warping.

Methods
In the section, we first give an overview of our DWNet framework and then describe each of its components in detail. Finally, we describe how to optimize the whole network for improving semantic video segmentation.

Overview.
e overall structure of our DWNet framework is illustrated in Figure 2. e inputs of our DWNet are a pair of RGB images I t and I t+k , where I t represents the labeled frame and I t+k represents the unlabeled frame randomly selected from the near-by frames of I t with k ∈ [− 5, 5]. e two images are first sent to the shared segmentation network to extract the semantic features F t and F t+k . Meanwhile, the two images are also sent to the optical flow estimation network to predict the coarse optical flow O t+k⟶t . en, we utilize the flow refinement module to optimize the optical flow and produce more accurate optical 2 Complexity flow O t+k⟶t for every pixel position. After that, we adopt the flow-guided convolution to dynamically warp F t+k to F t according to the refined optical flow O t+k⟶t . Finally, F t and F t are sent to the shared classifier to produce the segmentation map P t and P t respectively, and we introduce two kinds of temporal consistency losses as extra constraints to supervise the warped features F t and P t , respectively. In the following, we will introduce each key component of our DWNet in detail.

Flow Refinement Module.
We first utilize the existing optical flow estimation network to obtain the optical flow O t+k⟶t . e optical flow network computes the pixel displacement (Δx, Δy) for every pixel location (x, y) in I t to the spatial locations (x ′ , y ′ ) in I t+k , which means that (x ′ , y ′ ) � (x + Δx, y + Δy). And Δx and Δy are floating point numbers and denote pixel displacements in horizontal and vertical directions, respectively [6]. However, the optical flow estimated by the optical flow network may not be enough accurate due to occlusion and new objects. erefore, we propose the flow refinement module to optimize the coarse optical flow. We concatenate the two input images, the difference of the two images, and the coarse optical flow, resulting in an 11 channel tensor as the input to the flow refinement module. e flow refinement module consists of 4 convolution layers. e first 3 layers are made up of 3 × 3 kernels with stride 2 following BatchNorm and ReLU, and the number of the output channels is set to 64, 128, and 256, respectively. e output of the third layer is then passed on to the last 3 × 3 convolution layer with 2s 2 output channels to attain the refined optical flow O t+k⟶t , whose spatial size is corresponding to the feature F t and F t+k . s represents the kernel size of the flow-guided convolution which will be discussed in Section 3.3 and is set to 1 as default. We visualize the original optical flow and the refined optical flow, as shown in Figure 3. e refined optical flow has sharper motion boundaries for moving objects and semantics, such as humans and cars, which demonstrates the effectiveness of the flow refinement module. Next, we will introduce how to Ground truth Images Baseline Warping-based Ours Only in the training phase Figure 2: e overall structure of our DWNet framework. FRM denotes the flow refinement module. FG-Conv denotes the flow-guided convolution. Feature consistency loss and prediction consistency loss are both the temporal consistency loss, which improves the temporal consistency of video segmentation. e dotted lines denote that the components are only used in the training phase and will be removed in the inference phase.
Complexity 3 use the refined optical flow to achieve better features warping.

Flow-Guided Convolution.
e flow refinement module utilizes the original optical flow and images to produce more precise optical flow estimation. Given the optical flow, previous methods utilize the warping operation to transform the feature F t+k to the feature of the current frame F t : However, it cannot accurately align the warped feature and the feature of the current frame due to the imprecise optical flow and the original warping operation without any learnable parameters. Hence, we firstly utilize the flow refinement module to optimize the optical flow as discussed in Section 3.2. Besides, we propose the flow-guided convolution to adaptively warp the interframe features. e standard convolution samples the input feature map at fixed locations, and the DCNv1 [27] adds 2D offsets to the regular grid sampling locations to enable free form deformation of the sampling grid. Motivated by this work, we observe that the optical flow which encodes the pixel displacement across frames can be regarded as a specific offset, and we can utilize the optical flow to dynamically warp the interframe features. Formally, the standard 2D convolution can be written as where y denotes the output after the convolution, i denotes the location, x denotes the input features, w denotes the convolution filters with a length of P, and p enumerates P. p is usually the regular sampling locations in a s × s kernel, and we propose the flow-guided convolution by adding the location offsets into p as follows: where Δp ∈ O t+k⟶t . e refined optical flow is regarded as the offsets for the flow-guided convolution to adaptively sample more corresponding pixel locations between interframe features. e kernel size s is the key parameter for the flow-guided convolution, and we will discuss the parameter in 4.  4 Complexity offsets from the flow refinement module instead of applying a convolution layer to the input feature. Hence, we can attain more accurate offsets and achieve better feature warping.

Temporal Consistency
Loss. e flow-guided convolution can dynamically warp the feature F t+k and produce the estimated feature F t of the current frame. Previous methods concatenate or do the weighted sum of the warped feature F t and the original feature F t to achieve feature fusion and propagation. However, we argue that the warped feature F t is expected to be consistent with the original feature F t , and the two features should be the same ideally. Hence, we propose the temporal consistency loss to explicitly supervise the feature F t and the segmentation map P t respectively. Compared with the previous methods using feature fusion or fusion propagation, we utilize extra constraints to improve the temporal consistency of video segmentation, which is more reasonable and does not introduce additional calculation or postprocessing in the inference phase. e temporal consistency loss contains the feature consistency loss and the prediction consistency loss, which are related to the feature F t and the segmentation map P t , respectively.

Feature Consistency Loss.
We attempt to constraint both features of F t and F t to be similar enough by designing a feature consistency loss. Instead of per-pixel similarity calculation, we measure the similarity between the self-attention maps A t and A t of both features. Since the selfattention maps present high-order relationships among pixels, such a similarity measurement is more robust than the typical per-pixel one. Let a i,j denote the similarity between the ith pixel and the jth pixel of the original feature F t , and let a i,j denote the similarity between the ith pixel and the jth pixel of the warped feature F t , where a i,j ∈ A t and a i,j ∈ A t . e a i,j is computed from the feature F t,i and F t,j as And, we adopt the squared difference to formulate the feature consistency loss: where N denotes the total number of the pixels. e warped feature and the original feature should produce a similar attention map that encodes the pixel correlations. Hence, this loss can strengthen the feature consistency by explicitly supervising the attention maps.

Prediction Consistency
Loss. e segmentation map P t produced by the feature F t should be also consistent with the segmentation map P t of the current frame. Hence, we introduce the prediction consistency loss [17] to improve the temporal consistency of video segmentation as follows: Due to the occlusion and new objects across frames, we predict a mask M t+k⟶t to assign different weights to each pixel according to the warping error E t+k⟶t , where E t+k⟶t � |I t − I t | and I t denotes the warped input frame from I t+k . en, M t+k⟶t is denoted as where δ is a hyperparameter which controls the amplitude of the difference between high error and low error. e pixels with higher warping errors are assigned to lower weights and vice versa, because higher warping error represents that the optical flow and the warped feature are more inaccurate. M t+k⟶t can speed up the convergence of the prediction consistency loss and improve the accuracy of video segmentation by considering the pixels with more precise optical flow and ignoring the noise produced by occlusion and new objects.

3.5.
Optimization. e loss of our DWNet consists of the conventional cross-entropy loss ℓ ce and the temporal consistency loss including the feature consistency loss ℓ fc and the prediction consistency loss ℓ pc . Hence, our final objective function is where λ 1 and λ 2 denote the weights for multiple losses. As illustrated in Figure 2, our DWNet can be trained in an endto-end fashion. And in the inference phase, the optical flow network, the flow refinement module, and the flow-guided convolution in the dotted line can be removed. Hence, the final network can be regarded as a semantic image segmentation network with no additional calculation or postprocessing during inference.

Datasets.
We evaluate our proposed DWNet on two semantic video segmentation benchmarks datasets Cityscapes [29] and CamVid [30]. Cityscapes is an urban scene dataset and contains 5000 video snippets collected from 50 cities in different seasons. Each snippet contains 30 frames and only the 20th frame is pixel-level finely annotated, leading to the dataset containing 5000 images which are divided into 2975, 500, and 1525 images for training, validation, and testing respectively. Besides, the dataset also contains 20000 coarsely annotated images, but we do not utilize these data in all experiments except otherwise stated.
CamVid is composed of 701 densely annotated images from five video sequences. e images are labeled every 30 frames with 11 semantic classes. Following the previous Complexity 5 work [6], the dataset is split into 367 training, 101 validation, and 233 testing images.

Models.
To validate the effectiveness of our proposed method, we conduct extensive experiments with different network configurations. We adopt the ResNet50 [31], ResNet101 [31], and MobileNetv2 [32] as the backbone to extract the high-level features. And we choose the PSPNet [33], DeeplabV3+ [3], and DANet [5] as the segmentation model. e segmentation network is combined with different backbones and segmentation models. We conduct the ablation experiments on ResNet50 with the structure of PSPNet, namely, PSPNet50. Because the optical flow network can be removed in the inference phase, we adopt the more powerful optical flow estimation network FlowNetV2 [11] to extract the more accurate optical flow, though it is slower and with more parameters during training compared with the lightweight FlowNet, like [10,12].

Implementation Details.
We implement our method based on PyTorch. We employ an SGD optimizer and a poly learning rate policy, where the initial learning rate is multiplied by (1 − (epoch/max − epoch)) power with power � 0.9 after each iteration. e base learning rate is set to 0.01 for both datasets. Momentum and weight decay are set to 0.9 and 0.0001, respectively. We utilize the synchronized batch normalization [4] with a batch size of 8 for both datasets. For data augmentation, we apply random scaling of the input images (from 0.5 to 2.2 on Cityscapes, from 0.5 to 2.0 on CamVid), random cropping (768×768 for Cityscapes, 384×384 for CamVid), and random left-right flipping during training. Note that the optical flow network FlowNetV2 is also joint optimized with the base learning rate 0.00001. We employ the standard pixel-wise cross-entropy loss function as the main loss to train the whole network with 8 cards of NVIDIA TITAN RTX. e loss weights are set to be λ 1 � 10 and λ 2 � 0.1 for all experiments. After training, we utilize the original images to inference unless otherwise stated. Following the previous works [6,8], we apply mean intersection-over-union (mIoU) as the evaluation metric to validate our method.

Ablation Study.
We build the DWNet based on the single-frame segmentation model. And, we adopt the PSPNet50 as the single-frame model to conduct all the ablation experiments on the Cityscapes dataset.

Effectiveness of the Proposed Method.
In this section, we evaluate the different components of our DWNet with different settings, and the results are shown in Table 1. e baseline model is the PSPNet50 with single-frame training and inference. When we utilize the original warping operation and adopt the feature consistency loss as a constraint, the performance is only improved by 0.55%. However, when we replace the original warping operation with our proposed flow-guided convolution, it brings a further improvement by 0.57%, which demonstrates that the dynamic warping is better than the original warping operation. Besides, the flow refinement module and the prediction consistency loss can improve the performance by 0.47% and 0.38%, respectively. And introducing the two components simultaneously can further improve the accuracy to 75.62%. We also verify whether the two components are beneficial to the warpingbased method, and the results show that the accuracy can be improved from 74.3% to 74.76%, whose improvement is lower than our proposed method (from 74.87% to 75.62%).

Flow-Guided Convolution.
e flow-guided convolution is the core operation of our DWNet, which utilizes the refined optical flow to adaptively warp the interframe features. e kernel size s is the key parameter for the flowguided convolution. According to the original warping operation, each pixel corresponds to a specific offset, and we can utilize the offset to warp each pixel independently. However, we argue that we can consider more adjacent pixels to judge the warped result of each pixel. Hence, we can adjust s to achieve more precise feature warping. When s is equal to 1, the flow-guided convolution is similar to the original warping operation which treats each pixel independently. However, our flow-guided convolution contains the learnable parameters and can adaptively adjust the warped features. As shown in Table 2, when we set s to 3, the flow-guided convolution yields the best performance. Besides, the flow-guided convolution with different values of s all outperforms the original warping operation, which demonstrates that our proposed method can achieve better feature warping. When s is set to 5, the accuracy gets worse. We think that the larger s may bring more noise and influence the stable training of the whole model.

Prediction Consistency Loss.
e prediction consistency loss aims to improve segmentation stability. We calculate the occlusion mask to speed up the convergence and improve the accuracy of video segmentation by considering the pixels with more precise optical flow and ignoring the noise produced by occlusion and new objects. And the δ is a hyperparameter that controls the amplitude of the difference between high error and low error. Hence, we provide a discussion about the δ, and the results are shown in Warp denotes the original warping operation. ℓ fc and ℓ pc denote the feature consistency loss and prediction consistency loss, respectively. FG-Conv denotes the flow-guided convolution. FRM denotes flow refinement module. e bold values denote our method can achieve the best accuracy compared with other methods.
6 Complexity Table 3. We first try the prediction consistency loss without the occlusion mask, and we find the performance decrease by 0.22% compared with the baseline, which demonstrates the importance of the occlusion mask. If we treat all pixels equally, the pixels with high warping errors will seriously affect the training and the final segmentation accuracy. And when we introduce the mask and set δ to 2, it can obtain the best performance. In fact, the first designs for both temporal consistency losses consider the occlusion and new objects. However, the impact on the feature consistency loss is slight (from 74.87% to 74.89%). e occlusion and new objects usually reflect some small and local changes across different frames, and the feature consistency loss aims to model the long-range and high-order relationships and is more robust to such small changes, while the prediction consistency loss aims to model the per-pixel similarity and is susceptible to the occlusion and new objects. Hence, we only add the occlusion mask in the prediction consistency loss.

Feature Fusion and Propagation.
To mask the use of the warped features, previous methods try to do weighted sum or concatenate the warped features and the original features for feature fusion and propagation. We compare the previous methods with our proposed method in Table 4. e results show that our proposed method is obviously better than the previous methods, which demonstrates our conjecture to the warped feature reuse. e results are shown in Table 6, and our DWNet can outperform the existing methods with a significant advantage. In particular, with the PSPNet as the backbone, our method with the only fine set for the train can improve the mIoU score by 0.9%, which is superior to previous methods with both fine and coarse sets for the train, like [6,13,15]. And when we also utilize both fine and coarse images for the train, our method can bring a further improvement by 0.7%, which demonstrates the effectiveness of our method. Besides, we utilize the DANet as the segmentation network and the accuracy is improved to 82.1%, which shows that our method has a strong generalization for different segmentation networks.

Qualitative Results.
e qualitative comparison is shown in Figure 4. Existing warping-based methods adopt the standard bilinear interpolation without any learnable parameters to warp the interframe features based on imprecise precomputed optical flow and produce the negative

75.25
Sum and Concatenate denote the weighted sum and concatenation of the warped features and the original features for feature fusion, respectively. TCLoss denotes the temporal consistency loss, including feature consistency loss and prediction consistency loss. e bold values denote our method can achieve the best accuracy using both the feature consistency loss and prediction consistency loss.  Methods trained using both fine and coarse sets are marked with " ‡." e bold values denote our method can achieve the best accuracy compared with other state-of-the-art methods.

Frame k
Frame k + 1 Frame k + 5 Frame k + 7 Images Baseline Warping-based Ours Figure 4: Qualitative results of consecutive frames on the Cityscapes dataset. Baseline methods: training and inferring on single frames. Warping-based method: adopting the original warping operation to enhance the feature. Our method: utilizing the flow-guided convolution to adaptively warp the interframe features. Compared with the baseline, the warping-based method brings a slight improvement in the moving objects, and our method can produce more accurate and consistent segmentation results.  Table 7, and our method outperforms the current state-of-the-art methods, which demonstrates the generalization for different datasets.

Conclusion
In this paper, we propose a novel framework named DWNet to adaptively warp the interframe features. We design the flow refinement module to optimize the optical flow and propose the flow-guide convolution to achieve adaptive feature alignment. Besides, we introduce the temporal consistency loss to explicitly supervise the warped features to guarantee the temporal consistency of video segmentation. Extensive experiments have shown that our method outperforms existing warping-based methods and achieves state-of-the-art on the Cityscapes and CamVid benchmark datasets.

Conflicts of Interest
e authors declare that they have no conflicts of interest.