Multimodal Deep Feature Fusion ( MMDFF ) for RGB-D Tracking

1 Jiangsu Laboratory of Lake Environment Remote Sensing Technologies, Huaiyin Institute of Technology, Huaian, 223003, China 2Faculty of Electronic information Engineering, Huaiyin Institute of Technology, Huaian, 223003, China 3School of Physics & Electronic Information Engineering, Henan Polytechnic University, Jiaozuo, 454000, China 4School of Computer Science & Technology, Zhejiang University, 310058, China 5Institute of VR and Intelligent System, Hangzhou Normal University, 310012, China 6Faculty of Computer and Software Engineering, Huaiyin Institute of Technology, Huaian, 223003, China


Introduction
As one of the most fundamental problems in computer vision, visual object tracking, which aims to estimate the trajectory of a object in a video, has been successfully addressed in numerous applications, including intelligent traffic control, artificial intelligence, and autonomous driving [1][2][3].Despite the research on visual object tracking has made outstanding achievements, many challenges remain when seeking to track objects effectively in practice.For example, it is still quite difficult to track objects when occlusion occurs frequently, appearance changes, the motion of object is complex, and illumination varies [4][5][6].
The major drawback of the tracking methods only using RGB data is that they are not robust to appearance changes.Thanks to the availability of consumer-grade RGB-D sensors, such as Inter RealSense, Microsoft Kinect, and Asus Xtion, more accurate depth information of objects can be obtained to revisit the existing problems of tracking [7,8].Compared with RGB data, the RGB-D data can remarkably improve the performance of tracking due to the access to the depth information complementary to RGB [9].The depth information is invariant to illumination or color variations [10,11] and can provide geometrical cues and spatial structure information; thus it shows powerful benefits in visual object tracking.However, how to effectively utilize the depth data provided by RGB-D sensors is still a challenging issue.
However, RGB and depth only encode static information from a single frame so that tracking often fails when the motion of object is complex.Under these circumstances, deep motion features can provide high-level motion information to distinguish the target object [12][13][14].The dynamic information can be captured by deep motion features, and that will be complementary to static features extracted from RGB and depth image.

Complexity
Our motivation is to design a RGB-D object tracker based on multimodal deep feature fusion.Specific emphasis of this paper is placed on exploring three scientific questions: how to fuse deep motion features with static features provided by RGB and depth image; how to fuse the deep RGB and depth features sufficiently; how to effectively derive geometrical cues and spatial structure information from depth data.In summary, the key technical contributions of this study are three-fold: (ii) The depth image is encoded into three channels: horizontal disparity, height above ground, and angle with gravity.Then, the three channel images are sent into depth-specific CNN to extract deep depth features.The strong correlation between RGB and depth modality can be learnt by RGB-Depth correlated CNN.In contrast to only using depth information as one channel in many existing RGB-D trackers, we can obtain more useful information for tracking, such as geometrical features and spatial structure information, by encoding the depth image into three channels.
(iii) To evaluate the performance of our proposed RGB-D tracker, we conduct extensive experiments on the two recent challenging RGB-D benchmark datasets: the large-scale Princeton RGB-D Tracking Benchmark (PTB) Dataset [13] and the University of Birmingham RGB-D Tracking Benchmark (BTB) [15].The experimental results show that our proposed approach is superior to the state-of-the-art RGB-D trackers.
The remainder of this paper is organized as follows.The related work is discussed in Section 2. Section 3 describes the overview and the details of our proposed method.In Section 4, we demonstrate experimental results to evaluate our proposed RGB-D tracker.We conclude our work in Section 5.

RGB-D Object Tracking.
With the emergence of RGB-D sensors, there has been great interest in visual object tracking using RGB-D data [15][16][17] to improve tracking performance since depth modality can provide useful information complementary to RGB modality.
A RGB-D tracking method using depth scaling kernelised correlation filters and occlusion handling was proposed in [18].In [19], authors used Haar-like features and HOG features based on RGB and depth to form a boosting tracking approach.Zheng et al. [20] presented an object tracker based on sparse representation of depth image.Hannuna et al. proposed a RGB-D tracker built upon the KCF tracker, they exploit depth information to handle scale changes, occlusions, and shape changes.
Although the existing RGB-D trackers have made great contributions to promote RGB-D tracking, most of them used hand-crafted features and fused simply RGB and depth information, ignored the strong correlation between RGB and depth modality.

Deep RGB Features.
Owing to the superiority in feature extraction, CNN has been increasingly used in RGB trackers [21,22].The CNN includes a number of convolutional layer, pooling layer, and fully connected layer; the features at different layers have different properties.The higher layers can capture the senmantic features and the lower layers can capture the spatial features, and they are both important in the tracking problem [23,24].
In [25], Song et al. proposed the CREST algorithm, which treated the correlation filter as one convolutional layer and applied residual learning to capture appearance changes.C-COT was presented in [26], which employed an implicit interpolation model to solve the learning problem in the continuous spatial domain.Zhu et al. [27] proposed a tracker named UCT, which designed an end-to-end framework to learn the convolutional features and perform the tracking process simultaneously.
All these trackers only considered that RGB appearance deep features in current frame cannot benefit from geometrical features extracted from depth image and interframe dynamic information.Thus, it is important to get rid of this problem by fusing deep features from RGB-D data and deep motion features.

Deep Depth Features.
In the recent years, deep depth features have received a lot of attention in object recognition [28,29], object detection [30,31], indoor semantic segmentation [32,33], etc.Unfortunately, few existing RGB-D trackers use CNN to extract deep depth features to improve the tracking performance.In [34], Jiang et al. proposed a RGB-D tracker based on cross-modality deep learning, in which Gaussian-Bernoulli deep Boltzmann Machines (DBMs) were adopted to extract the features of RGB and depth image.As far as the drawbacks of DBMs are concerned, one of the most important ones is the high computational cost of inference.
In recent years, CNN has witnessed great success in computer vision.In this context, we will focus on how to use CNN to fuse deep depth feature with deep RGB and motion features effectively for RGB-D tracker.

Deep Motion Features.
Deep motion feature has been successfully applied for action recognition [35] and video classification [36], but it is rarely applied to visual tracking.Most existing tracking methods only extract appearance features, ignore motion information.In [37], Zhu et al. proposed an end-to-end flow correlation tracker, which focuses on making use of the rich flow information in consecutive frames to improve the feature representation and the tracking accuracy.Danelljan [38] investigated the influence of deep motion features in a RGB tracker.But they have not taken depth information into account.
To the best of our knowledge, deep motion features are yet to be applied for RGB-D tracking.In this paper, we will discuss how to fuse deep motion features with RGB appearance features and geometrical features provided by depth image to improve the performance of RGB-D tracking.

Multimodal Deep Feature Fusion (MMDFF) Model.
In this section, a novel MMDFF model is proposed for RGB-D tracking aiming at fusing deep motion features with static appearance and geometrical features provided by RGB and depth data.The overall architecture of our approach is illustrated as Figure 1, and our end-to-end MMDFF model is composed of four deep CNN: motion-specific CNN, RGB-specific CNN, depth-specific CNN, and RGB-Depth correlated CNN.A final fully connected (FC) fusion layer is proposed to effectively fuse deep RGB-D features and deep motion features; then the fused deep features are sent into the C-COT tracker, and the tracking result can be obtained at last.
As described in Section 1, CNN has been shown to significantly outperform traditional machine learning approaches in a wide range of computer vision tasks.It has been widely acknowledged that CNN features extracted from different layers play different important roles in the tracking.A lower layer captures spatial detailed features, which is helpful to precisely localize the target object; at the same time, a higher layer provides semantic features which is robust to occlusion and deformation.In our MMDFF model, we separately adopt hierarchical convolutional features extracted from RGB-specific CNN and depth-specific CNN.To be more specific, two independent CNN networks are adopted: the RGB-specific CNN is for RGB data and the depth-specific CNN is for depth features.And the features on Conv3-3 and Conv5-3 in the two CNNs are sent to the highest fusing FC layer in our experiments.
It is believed that more features extracted from different modalities can be helpful to accurately descript the objects and improve the tracking performance.As abovementioned, most of existing RGB-D trackers directly concatenate features extracted from RGB and depth modalities, not adequately exploiting the correlation between the two modalities.In our method, the strong correlation between RGB and depth modality can be learnt by RGB-Depth correlated CNN.
For human visual system, geometrical and spatial structure information plays an important role in tracking objects.In order to more explicitly derive geometrical and spatial structure information from depth data, we encode it into three channels: horizontal disparity, height above ground, and angle with gravity, using the encoding approach proposed in [39].Then, the three channel image is sent into depth-specific CNN to extract deep depth features.
The optical flow image is calculated for every frame and then is fed to motion-specific CNN to learn deep motion features, which can capture high-level information about the movement of the object.A pretrained optical flow network provided by [13] is used as our motion-specific CNN, which is pretrained on the UCF101 dataset and includes five convolutional layers.
Thus far, we have obtained multimodal deep features to represent the rich information of the object, including the RGB, horizontal disparity, height above ground, angle with gravity, and motion.Next, we attempt to explore how to fuse the multimodal deep features using the CNN.To solve this problem, we conduct extensive experiments to evaluate the performance using different fusion schemes, each experiment fuses the multimodal deep features at different layer.Inspired by the working mechanism of human visual cortex in the brain which indicates the features should be fused in the high level, so we test fusing at several relatively high layers, such as pool 5, fc 6 and fc 7. We find that fusing the multimodal deep features from fc 6 and fc 7 can obtain better performance.
Let    represent feature map of  modalities and  denotes the spatial position,  = {1, 2, . . ., }.In our paper,  = 4 as we adopt the feature maps from Conv3-3 and Conv5-3 in RGB, depth, RGB-depth correlated, and motion modality.The fusing feature map  sion  is the weighted sum of feature maps for three levels in four modalities, where the weigh    can be computed as follows:

C-COT Tracker.
The multimodal fusion deep features are sent into the C-COT tracker, which was proposed in [26].
A brief review of the C-COT tracker will be provided in the following of this section, and we will use the same symbols as in [33], for convenience, and more detailed description and proofs can be found in [26].The C-COT transfers multimodal the fusion feature map to the continuous spatial domain  ∈ [0, T) by defining the interpolation operator   : R   →  2 (T) as follows: The convolution operator is defined as The objective function is Equation ( 5) can be minimized in the Fourier domain to learn the filter.The following can be obtained by applying Parseval's formula to ( 5) The desired convolution output ŷ can be provide by the following expression:

Experimental Results
The experiments are conducted on two challenging benchmark datasets: BTB dataset [17] with 36 videos and PTB dataset [7] with 100 videos to test the proposed RGB-D tracker using MATLAB R2016b platform with the Caffe toolbox [40] on a PC with an Intel(R) Core(TM) i7-4712MQ CPU@3.40GHz(with 16G memory) and TITAN GPU (12.00 GB memory).To intuitively show the contribution of fusing deep depth features and deep motion features for RGB-D tracking, in the following, we demonstrate part of experimental results on the BTB dataset and PTB dataset that only using deep RGB features obtains unsatisfactory performance.The tracking result only using deep RGB features is in yellow, fusing deep RGB and depth is in blue, adding deep motion features to RGB, and depth is in red.
In Figure 2, the major challenge of the athlete move video is that the target object moves fast from left to right.As illustrated in Figure 3, the cup is fully occluded by the book.The zballpat no1 sequence is challenging due to the change of moving direction (Figure 4).The deep motion features are able to improve the tracking performance as they can exploit the motion patterns of the target object.
Figure 5 shows the comparison results of SR of different trackers on the PTB dataset.The results illustrates that the overall SR of our tracker is 87%, SR is 86% when the object moves fast, and SR is 84% when the motion type is active.These values are higher than other trackers obviously, especially when the object moves fast.
Figure 6 indicates the AUC comparison results on the BTB dataset.The overall AUC of our tracker is 9.30, the AUC is 9.84 when the camera is stationary, and the AUC is 8.27 when the camera is moving.
From Figures 5-6, we can see that our tracker obtains the best performance, especially when the object is moving fast or the camera is moving.These results show that deep motion feature is helpful to improve the tracking performance.

Conclusion
We study the problems in the existing visual object tracking algorithms and find that existing trackers cannot benefit   from geometrical features extracted from depth image and interframe dynamic information by fusing deep RGB, depth, and motion information.We propose MMDFF model to solve these problems.In this model, RGB, depth and motion information are fused at multiple layers via CNN, the correlated relationship between RGB and depth modality exploited by using RGB-Depth correlated CNN.The experimental results show that deep depth feature and deep motion feature provide complementary information to RGB data and the fusing strategy promotes the tracking performance significantly, especially when occlusion occurs, the movement is fast, motion type is active, and target size is small.

Figure 2 :
Figure 2: The comparison results on athlete move video from BTB dataset.

Figure 3 :
Figure 3: The comparison results on cup book video from PTB dataset.

Figure 4 :
Figure 4: The comparison results on zballpat no1 video from PTB dataset.

Figure 5 :
Figure 5: The comparison results of SR on the PTB dataset.

Figure 6 :
Figure 6: The comparison results of AUC on the BTB dataset.

Table 1 :
Comparison results: SR using different deep features fusions on the PTB dataset. 