In recent years, visual object tracking has become a very active research field which is mainly divided into the correlation filter-based tracking and deep learning (e.g., deep convolutional neural network and Siamese neural network) based tracking. For target tracking algorithms based on deep learning, a large amount of computation is required, usually deployed on expensive graphics cards. However, for the rich monitoring devices in the Internet of Things, it is difficult to capture all the moving targets in each device in real time, so it is necessary to perform hierarchical processing and use tracking based on correlation filtering in insensitive areas to alleviate the local computing pressure. In sensitive areas, upload the video stream to a cloud computing platform with a faster computing speed to perform an algorithm based on deep features. In this paper, we mainly focus on the correlation filter-based tracking. In the correlation filter-based tracking, the discriminative scale space tracker (DSST) is one of the most popular and typical ones which is successfully applied to many application fields. However, there are still some improvements that need to be further studied for DSST. One is that the algorithms do not consider the target rotation on purpose. The other is that it is a very heavy computational load to extract the histogram of oriented gradient (HOG) features from too many patches centered at the target position in order to ensure the scale estimation accuracy. To address these two problems, we introduce the alterable patch number for target scale tracking and the space searching for target rotation tracking into the standard DSST tracking method and propose a visual object multimodality tracker based on correlation filters (MTCF) to simultaneously cope with translation, scale, and rotation in plane for the tracked target and to obtain the target information of position, scale, and attitude angle at the same time. Finally, in Visual Tracker Benchmark data set, the experiments are performed on the proposed algorithms to show their effectiveness in multimodality tracking.
National Natural Science Foundation of China61772575National Key R&D Program of China2017YFB1402101Minzu University of China1. Introduction
Visual object tracking (VOT), the subfield of computer vision, is a process of continuously estimating the target state through video image sequence. In recent years, VOT has become a very active research domain due to its extensive applications in many sorts of fields such as intelligent surveillance [1], automatic driving [2], and traffic flow monitoring [3], to name a few.
In fields such as security monitoring and control, the traditional network architecture is difficult to deal with in terms of network delay and security reliability, and thus edge computing technology was born. Tasks with different attributes can be passed to different levels for processing. Zhan [4] shows that the first few feature extraction layers could run on edge device, and the others run on the cloud. And Gao [5, 6] divides tasks into different levels according to the business applications and using edge devices in one level.
As Figure 1 shows, for nonsensitive areas, video streams with lower resolutions can be processed on the local device; in the medium area, ordinary-resolution video streams can be used on the edge device; and in high-risk areas, high-resolution video streams can be used on the core cloud server, thereby reducing network bandwidth and improving the overall operating efficiency of the system. This article mainly explores the processing of video streams on edge clouds, and the tracking algorithm used is based on filtering.
End-edge-cloud computing graphic.
A number of robust tracking algorithms have been proposed and developed to deal with the problems resulted from occlusion, illumination variation, background clutter, motion blur, and so on [7–14]. Such these algorithms are divided into deep learning-based category (DLC) and correlation filter- (CF-) based category [9].
For DLC, since the papers written by Geoff Hinton et al. [15, 16] were published, deep learning has become especially popular in the context of deep neural networks and has achieved impressive success on many applications, especially on feature extraction in computer vision. Inspired by such success, various deep-learning-based trackers [13, 14, 17–22] have been proposed and developed to cope with the problems encountered in tracking. Although most of trackers based on deep neural networks demonstrated the potential advantages for significantly improving the tracking performance which was testified by world VOT competitions [17], there are still some obvious limits. For example, there are fewer or even no training data available for the tracker because the prior information of the tracked object or the object bounding box is usually available only in the first frame. Even if the offline pretraining is employed to learn the target features for constructing a feature set of many targets, it is very possible for tracking a particular object whose features are not contained in the feature set. Nowadays, zero-shot and one-shot learning, as well as Siamese region proposal network, may be the most effective measures to cope with these problems [23–27].
And correlation filter-based tracking is also a solution. From its beginning of the minimum output sum of squared error (MOSSE) method [10] to the discriminative scale space tracker (DSST) method [12], a lot of improvements have continuously been made, which makes tracking based on CF achieve some highlighted tracking performances, such as lower computational load, being robust to the appearance variations of targets, and high tracking accuracy. However, there are still some improvements that need to be further studied for the tracking based on CF. One is that the algorithms do not consider the target rotation on purpose. The other is that the real-time property of DSST cannot always be ensured because it has a very heavy computational load to extract the histogram of oriented gradient (HOG) features from so many patches centered at the target position in order to ensure the scale estimation accuracy. This inspired us to think of an idea: on the premise of ensuring the tracking accuracy, appropriately decrease the number of patches in order to save the time for the introduction of the target rotation into DSST to form a multimodality tracking. It means that the tracker should simultaneously cope with translation, scale, and rotation in plane for the tracked target, which leads us to propose the visual object multimodality tracking based on correlation filters (MTCF), to figure out these two problems, and at the same time to obtain the target information of position, scale, and attitude angle simultaneously.
In this paper, we design a correlation filter-based tracker aiming at tracking the target accurately and robustly with the tracking speed at 25 frames per second and tracking the rotation of target.
2. Related Materials
In this section, centering on tracking based on CF, we briefly list some relevant research works which have contributions to the tracking based on CF to highlight our motivations.
MOSSE method is taken as the earliest real-time CF-based tracker [28], which is an improved version of average synthetic exact filter (ASEF) [29] trained offline to detect objects. MOSSE tracker has strong robustness to target appearance and environment change, which can achieve very fast tracking speed. This is because that the correlation convolution of image in time domain is transformed into the multiplication of image in frequency domain, which greatly reduces the computation complexity and load. However, MOSSE method uses only grayscale samples to train CF and mainly focuses on translation without considering scale and rotation.
Based on the MOSSE, the circulant structure kernel CSK method [30] constructs a circulant matrix of training data by using cyclic shift of target window to maintain dense sampling around the target, rather than random sampling. On the other hand, CSK method maps ridge regression of linear space to nonlinear space through a kernel function and simplifies the calculation via solving a dual problem in nonlinear space to avoid inverse matrix operation, which leads to reducing the computation complexity and improving the tracking speed. The kernelized correlation filter (KCF) method [11] is an improved version of CSK. It introduces multichannel HOG features into CSK to enhance the feature representation ability and to improve the tracking performance significantly. Nevertheless, there exists a major imperfection for KCF method; i.e., it is not robust to the scale variation of the target. In addition, for the KCF-based tracking, the authenticity of negative samples will decrease along with the increase of cyclic displacement, which results in the tracker being trained on a portion of unreal samples. To address this issue, Danelljan et al. [18] introduce a spatial regularized term in the goal function of KCF-based tracker to penalize the filter coefficients near the margins of the bounding box. Based on [18], Dai et al. [28] propose a novel adaptive spatial regularized CF to make the tracker learn more reliable filter coefficients by fully exploiting the diversity information of different objects in different frames during the tracking process. However, just as the standard KCF-based trackers do, these two trackers are still not robust to the scale variation of the target.
DSST [12, 31] trackers address the scale adaptation problem using multiscale searching strategies. It divides tracking into translation prediction and scale prediction. Firstly, translation prediction is performed by applying a standard translation filter on the current frame to get the position of the target. Secondly, the target size is estimated by employing trained scale filter at the target location obtained from the translation filter. Translation filter and scale filter are two independent filters, and both are based on MOSSE. Although DSST tracker has improved the tracking performance and is robust to target scale variation, there exist some obvious limitations to be further perfected. One is that DSST does not consider the target rotation on purpose, which has strong negative impacts on the tracking performance. The other is that it is not necessary for guaranteeing the tracking speed to spend a lot of operation time on sampling too many patches centered on the target location.
Besides of the tracking method, features of the tracked target are also key components of a tracker, which has a very heavy influence on the tracking performance. Generally speaking, the richer the features are, the better the performance of the tracker is. The simplest feature is intensity matrix of the search image, which is used in MOSSE [10]. And SIFT features [32] and HOG features [33] are used in object tracking afterwards. In recent years, deep features [34] are widely used in object tracking. In this paper, HOG features combined with grayscale features rather than deep features are adopted because our focus is on the CF-based tracking. And we do not adopt SIFT because SIFT is scale-invariant and we need to explicitly capture the size change of the object.
Summarizing the analysis stated above, we propose the MTCF to alleviate the imperfections of the relevant CF-based trackers stated above. Aiming at tracking the target accurately and robustly with the tracking speed at 25 frames per second at least for practical visual object tracking, MTCF consists of 4 tasks. Firstly, based on the standard CF-based translation tracker, determine the target location in the current frame. Secondly, based on DSST, sample several patches (with alterable number of patches) with different resolutions, centered at the tracked target location determined by translation CF, figure out the feasible scale for patches, and seek out an optimal decision policy to find the final scale among feasible scales. Thirdly, based on the standard CF-based translation tracker, design a rotation tracker using space searching. Lastly, integrate the previous 3 tasks to form MTCF.
3. Methodology of the Tracking Design3.1. Variable Symbols Used in This Paper
In this paper, f denotes the “feature” of one image patch cropped with specific bounding box, h denotes the correlation filter, and g denotes the response map of correlation. In this way, ftrans,i denotes the feature of ith frame used to correlate with translation filter htrans and we get the translation response map gtrans,i.
And si denotes the scale of the target the tracker got after ith frame, and ri denotes the rotation angle of the target after tracking ith frame.
In terms of the convolution theorem, the correlation in spatial domain can be transformed to element-wise manipulation, which will dramatically reduce the correlation computation load. Thus, for computation efficiency, correlation manipulation is proposed to use Fast Fourier Transform (FFT) method in frequency domain. So, let the uppercase variables be the Fourier transforms of their lowercase counterparts, i.e., Ftrans,i, Gtrans,i, and Htrans corresponding to ftrans,i, gtrans,i, and htrans, respectively.
3.2. Standard Translation Tracker Based on Correlation Filter
As Figure 2 shows, given a video sequence, draw a rectangular bounding box (the very close same size as the target, the red one) around the target in first frame and extract a feature map ftrans,i from the chosen region (the green rectangle, two times the size of the red one). And then train a correlation filter htrans to correlate with ftrans,1 to get an ideal response gtrans,1. In the next frames, use the correlation filter htrans to correlate with extracted feature map from the chosen region and get a response map gtrans,i as follows:(1)gtrans,i=ftrans,i∗htrans,where ∗ represents convolution operation.
Correlation filter-based tracking diagram (the two frames are from sequence “Trellis” in [35]).
In normal tracking process, there should be one peak in the response map. And the peak position is considered as the center of target (and in this sense tracking executes). The key of tracking is to find a robust feature extractor and maintain the correlation filter htrans to counter a variety of adverse effects such as target appearance transformation, occlusion, and so on, using appropriate updating strategy.
3.3. Scale Tracker Based on Correlation Filter
Being different from that in the original DSST, the number of scales S (or the number of image patches) in this paper is an optional positive integer determined by the trade-off between tracking speed and tracking accuracy (i.e., smaller S is selected if tracking speed takes priority to tracking accuracy, vice versa). Let M,N be the shape of the target, and construct image patches centered on the target position pi with different scales to form an image patch set(2)Bscale,i=βnM×βnN|n=0,±1,±2,…,±roundS2,where β is scale step. Resize each βnM×βnN from Bscale,i into the same shape to form a bounding box set patchscale,i. As Figure 3 shows, instead of extracting one feature map from a bounding box with fixed scale, the tracker extracts a feature map fscale,in for each patch from the bounding box set patchscale,i (the number of feature maps is 33 in Figure 3). Each feature map fscale,in is concatenated into a vector, and all these vectors are combined into a feature map fscale,in. And we design a scale correlation filter to correlate the feature map fscale,in, and the scale where maximum response taking place is the predicting scale to match current scale of target.
Scale tracking diagram.
3.4. Rotation Tracker Based on Correlation Filter
The target may rotate during tracking, so we use rotated bounding box centered on target to crop every frame. As shown in Figure 4, let pi be the center of the target and r the current angle the target rotated, and Brotate,ir denotes the bounding box with the rotation in frame i. And using the rotation around target center, we construct a set of bounding boxes(3)Brotate,i=Brotate,ir−Θ,…,Brotate,ir−θ,Brotate,ir,Brotate,ir+θ,…,Brotate,ir+Θ,with the same size of Brotate,ir; here Θ is the given maximum rotation angular displacement and θ is rotation step.
Bounding boxes rotate around the target centerpi.
For each bounding box Brotate,ir from Brotate,i, extract feature map frotate,irot, and correlate with the rotation correlation filter to get a maximum response value, where(4)rot∈r−Θ,…,r+Θ.
Compare those values and find the largest one to get the predicting rotation angle r. Let r be the tracking result of frame i, as follows:(5)ri=maxrot∈r−Θ,…,r+Θmaxgrotate,irot.
In addition to the methods we used here, we also envisioned the “1-dimensional correlation rotation tracking” in the Supplementary Materials. However, after testing, it shows that this method requires too much calculation and is not suitable for use at edge nodes.
3.5. Multimodality Tracking Based on Correlation Filter
Integrate translation, scale, and rotation stated in previous section to form MTCF whose iteration procedure at the ith frame is briefly outlined with the known parameters obtained in the i+1th frame, including target position pi−1, translation filter htrans, scale filter hscale, scale si−1, rotation filter hrotate, and rotation angle ri−1.
3.5.1. Translation Estimation
Construct bounding box Btrans,i with the scale si−1, centered at pi−1 in the ith frame.
Extract feature map ftrans,i from Btrans,i.
Calculate the correlation map gtrans,i using ftrans,iand htrans.
Obtain the target new position pi corresponding to the position where the largest correlation value of gtrans,i taking place.
3.5.2. Scale Estimation
Construct image patches patchscale,i of different scales centered on the target position pi in the ithframe
Extract feature map patches fscale,i from image patches patchscale,i, and concatenate each feature map fscale,i∈fscale,i to form an vector, and then combine those vectors to form a feature matrix fscale,i
Calculate the correlation map gscale,i using fscale,i and hscale
Update the target scale with the optimal s corresponding to the position where the largest-scale correlation value is located
3.5.3. Rotation Estimation
Construct image patches patchrotate,i from the bounding box set Brotate,i centered on target position pi with rotation angle ri−1
Extract feature maps frotate,irot for every patch from patchrotate,i
For every feature map frotate,irot, make the correlation with the original rotation filter, and get a maximum response value scorerotate,i
Update the target rotation angle ri with the optimal rot corresponding to the best scorerotate,irot
3.5.4. Model Update
Construct the bounding box Btrans,i centered on target position pi=xi,yi with scale s and rotation angle r
Extract ftrans,i, fscale,i, and frotate,i
Update translation model
Update scale model
Update rotation model
3.5.5. Keep Tracking
Output the tracking results of the ith frame and return to the next frame tracking.
4. MTCF: The Entire Model4.1. Translation Tracking Procedure
The simplest correlation-based tracking only focuses on translation of the target. In the first frame, we label a rectangular region Btrans,1 centered on the target. So, the tracker can extract the feature map of target appearance. The feature map must maintain a spatial mapping because the tracker uses the position where the maximum response happens to predict new target position.
The simplest feature map is gray intensity matrix transformed from the specific region (for example, Btrans,1) of original frame. Many researchers use 2-dimensional Hanning window (see Figure 5) to preprocess the primitive intensity matrix. After being processed by the Hanning window, the intensity matrix focuses on the central region of target and weakens the background information near the bounding box edge. Because in the first frame we draw the bounding box tightly around the target, the tracker may lose some features and behave unstable.
The 2-dimensional Hanning window (the shape is 40×30).
To address this issue, the simple way is to expand the search region. Define a parameter bb to determine how many times the size of Btrans,1 the search window is. As a rule, greater parameter bb will contribute to extracting more features of the target and making the tracker stable. But much more time will be spent on the extracting features from the large search region.
This is clearly demonstrated by “S1” presented in the Supplementary Materials. In this paper, we adopt a trade-off policy and select the bb=2.
From now on, we will use Btrans,1 to represent the translation search windows.
From Btrans,1 we get the ftrans,1, and because we need to train the translation correlation filter htrans, an initial gtrans,1 is required. In prior papers, most of researchers take the Gauss-shaped response map as initialization, as follows:(6)gexpl=e−d/σ,d=x−xcenter2+y−ycenter2,σ=1,x,y∈Btrans,1,and Figure 6 shows an example of gtrans,1.
The Gauss-shaped response map (σ=1, shape of B is 40×30).
Though the intensity feature is cheap in computation, it is unstable. Because it only takes advantage of little information in the frame. Recently, lots of deep features (for example, convolution neural network feature) are introduced to object tracking and behave well in accuracy and robustness. However, it is too computational expensive, and in this paper, we focus on FHOG [36] feature.
We use an FHOG feature extractor to get the feature map ftrans,i. In translation step, 27 dimensions in FHOG and 1 dimension of intensity feature are taken into account. According to DSST [12], discriminative correlation filters for multidimensional features are applied as follows.
Minimize the cost function,(7)ε=∑l=1dfl∗hl−g+λ∑l=1dhl.
Here, g is the ideal response about the correlation between feature map and filter, and the parameter λ≥0 is the regularization term. In FFT domain, the solution [12] can be written as(8)Hl=Fl⊙G∗∑k=1dFk⊙Fk∗+λ,where ∗ indicates the complex conjugation, ⊙ is for element-wise multiplication, and l∈1,…,d is the dimension number.
The translation filter Htrans can be solved as below:(9)Htrans∗=Gtrans,i⊙Ftrans,i∗∑lFtrans,il⊙Ftrans,il∗+λ.
Equation (9) is employed in offline learning to obtain the correlation filter. In the practical tracking, the tracker (for example, MOSSE, KCF, and DSST) takes the target position in the i−1th frame as the center of bounding box Bi in the ith frame, extracts feature map ftrans,i from Bi, and then calculates the correlation map(10)gtrans,i=F−1Gtrans,i=Ftrans,iHtrans,i−1∗,to determine the target new position pi corresponding to the element with maximum value in strans,i; here, F−1 indicates the inverse Fourier transform.(11)pi=maxstrans,ik,l.
Afterwards, reconstruct bounding box Bi centered on the target new position pi from which feature map ftrans,i is extracted, and then update the translation CF to get Htrans,i∗. Lastly, an iterative formula for equation (9) is presented as the following equations from equations (9)–(12) according to [10, 12]:(12)Gtrans,i=Ftrans,i⊙Htrans,i−1∗,(13)Atrans,i=ηGtrans,i⊙1−ηAtrans,i−1,(14)Dtrans,i=ηFtrans,i⊙Ftrans,i−1∗+1−ηDtrans,i−1,(15)Htrans,i∗=Atrans,iDtrans,i,where η is the learning rate.
4.2. Scale Tracking Procedure
As for scale tracking procedure, two methods are commonly used. One is called “exhaustive scale tracking” and the other is “1-dimensional correlation filter scale tracking.” In this paper, we use “1-dimensional correlation” method.
In the previous frame, we got the target position pi−1 and scale si−1.
Let M,N be the shape of the target, construct image patches centered on the target position pi in terms of the method presented in Section 3, and resize to form a bounding box set patchscale,i. FHOG extractor is applied to extract a feature map fscale,in for each patch from the bounding box set patchscale,i. Each feature map fscale,in is concatenated into a vector, and all of these vectors are combined into an integrated vector fscale,i. Estimating the target scale can be solved by learning a separate 1-dimensional correlation filter. Design a 1-dimensional filter hscale to correlate with fscale,i. The initial ideal response gscale,i is a Gauss-shaped peak, as Figure 7 shows.
The 1-dimensional Gauss-shaped peak as the initial correlation response.
The scale with the largest correlation response value is taken as the optimal scale si.
Afterwards, extract feature map fscale,i from the patchscale,in centered on the target new position pi with the target final scale si, and then update the scale correlation filter to get Hscale,i∗ using equations (13)–(17).
In this process, set the parameter “spatial bin size” to 4 to save time in the next process and use all FHOG dimensions. So, the length of S feature vector is M/4×N/4×31.
Estimating the target scale can be solved by learning a separate 1-dimensional correlation filter. Treat the feature vector as multidimensional features and S vectors turn into 1-dimensional feature fscale,i=fscale,isca1,…,fscale,iscaS.(16)gscale,i=fscale,i∗hscale,(17)Gscale,i=Fscale,i⊙Hscale,i−1∗,(18)Ascale,i=ηGscale,i⊙Fscale,i−1∗+1−ηAscale,i−1,(19)Dscale,i=ηFscale,i⊙Fscale,i−1∗+1−ηDscale,i−1,(20)Hscale,i∗=Ascale,iDscale,i.
Construct different groups containing different number of patches. The number of patches varies from small to large (e.g., from 10 to 55), and all patches are centered at the tracked target location determined by translation CF in current frame. Let the basic DSST [12] perform on the visual track data set [35], and calculate the tracking speed and the tracking accuracy which is characterized by the Euclidean distance between tracking window center and ground truth center for each group. The experiment results are shown as in Figure 8.
The tracking accuracy and speed results for groups with different scales.
From Figure 8, it can be seen that the tracking accuracy and speed are different with different numbers of patches. The larger number of the patches corresponds to a low tracking speed, and vice versa. Thus, on the premise of ensuring the tracking accuracy, appropriately select the number of patches in order to save the time for the introduction of the target rotation into DSST to form a multimodality tracking. After experiments, it is found that most of the time is spent in the feature extraction module.
4.3. Rotation Space Tracking Procedure
Set the target attitude with r=0 in the first frame; construct a set of bounding boxes Brotate,i as described in the previous section in successive frame. FHOG extractor is applied to extract a feature map frotate,irot for each patch Brotate,irot from the bounding box set Brotate,i. Estimating the target rotation can be solved by learning a separate 1-dimensional correlation filter. Train a 1-dimensional a single rotation filter hrotate as the similarity function to compute the maximum correlation response maxgrotate,irot for each feature map frotate,irot. Therefore, the best tracking angle ri is calculated by using the following equation:(21)ri=maxrotmaxgrotate,irot.
Afterwards, extract feature map frotate,i from the Brotate,irot centered on the target new position pi with the target final rotation angle ri, and then update the rotation correlation filter to get Hrotate,i∗ using the following equations:(22)grotate,i=frotate,i∗hrotate,(23)Grotate,i=Frotate,i⊙Hrotate,i−1∗,(24)Arotate,i=ηGrotate,i⊙Frotate,i∗+1−ηArotate,i−1,(25)Drotate,i=ηFrotate,i⊙Frotate,i−1∗+1−ηDrotate,i−1,(26)Hrotate,i∗=Arotate,iDrotate,i.
We take Figure 9 as an example to demonstrate our search rotation angle. As Figure 9 shows, r=0∘, θ=20∘, Θ=40∘, androt∈−40,−20,0,20,40. Construct a set of bounding boxes Brotate,i with 5 patches Brotate,irot; train the rotation correlation filter hrotate using the samples in the first frame. And Figure 10 shows the correlation response with each patch, where the r+20∘ corresponds to the highest response. Thus, we can make a conclusion that r+20∘ is the best predicting rotation angle in Figure 9, which demonstrate the effectiveness of our proposed search rotation method.
Exhausted method for tracking rotation.
The correlation response with image patches cropped by different-rotation bounding box. (a) r + 40°, (b) r + 20°, (c) r, (d) r − 20°, and (e) r − 40°.
In this process, how to set parameters of θ and Θ is very important to obtain the good tracking performance including tracking speed and tracking accuracy. Greater Θ and smaller θ will contribute to the good tracking performance, but much more time will be spent on the extracting features of the tracked target, which has a negative influence on the tracking speed. This is clearly demonstrated by “S2” presented in the Supplementary Materials. As a rule, parameters of θ and Θ are fixed by experiments according to the requirements of tracking tasks.
In this paper, we also adopt such a policy.
5. Experiment5.1. Experiment Setup
In this paper, our method is implemented in MATLAB R2019a on Windows 10 system. The experiments are conducted on a PC with Intel Xeon® 2.4 GHz and 63.9 GB RAM. The data set is selected from the visual track data set [35]. Our experiment is divided into 3 groups with different parameters. All of them are used to testify our proposal method: on the premise of ensuring the tracking accuracy, appropriately decrease the number of patches in order to save the time for the introduction of the target rotation into DSST to form a multimodality tracking, to verify the effectiveness of our proposed rotation tracking algorithm, and to demonstrate the whole tracking performance of our proposed visual object multimodality tracking algorithm based on correlation filter.
In the experiment of each group, the visual track data set is selected to have target translation, scale, and rotation simultaneously. And the number of scales S, the scale factor β, and the learning rate η are kept unchanged in each group and are fixed as (33, 1.02, and 0.015) and (27, 1.0247, and 0.015), respectively, which means that maximum and minimum scale field of two groups are the same, as Figure 11 shows. And we test the influence of different size bb of searching window in the Supplementary Materials.
β is the function of S to guarantee fixed maximum and minimum scale field.
5.2. Experiment of the First Group
In this group experiment, Θ is selected to be 10∘, and θ is selected to be 5∘. As a result, the tracking speed is 31fps, and the experiment results are shown in Figures 12 and 13 consisting of some typical tracking frames.
The tracking results with tracking speed 31fps. The red window represents that the target is tracked completely. (a) represents the original figure, (b) represents object translation, (c) represents object rotation, and (d) represents object scale.
The tracking results with tracking speed 29 fps. The red window represents that the target is tracked completely. (a) Rotation, (b) scale, (c) scale, and (d) translation.
From Figure 11, it can be seen that appropriately decreasing the number of patches completely can save the time for the introduction of the target rotation into DSST to form a multimodality tracking on the premise of ensuring the tracking accuracy and that our proposed rotation tracking algorithm can work well.
5.3. Rotation Tracking Performance Test
In this group experiment, θ is selected to be 12, and τ is selected to be 4. As a result, the tracking speed is 29 fps, and the experiment results are shown in Figure 13 consisting of some typical tracking frames.
In this group experiment, the tracking speed is 29 fps which is lower than that in the first group experiment because τ is selected to be 4 which means the number of Brotate,iρ is increased, resulting in much more time being spent on extracting feature map frotate,iρ from Brotate,iρ. But our proposal visual tracker still can work well in tracking the target with translation, scale, and rotation. This can be shown by Figure 13. From this perspective, we can say that the rotation step can be appropriately increased if tracking accuracy is preferred, and vice versa.
5.4. Multimodality Tracking Performance Test
In both of the two group experiments, our proposed MTCF algorithm is performed on the visual track data set [35] to demonstrate the multimodality tracking performance; the tracking results are shown in Figure 14.
Some tracking results using MTCF, where the first five rows show that our tracker could track the target in scale variation, partially occlude occasion, and illumination variation, and the last row shows that the rotation tracking performance depends on scale tracking.
From Figure 14, it can be seen that our proposed MTCF has good multimodality tracking performance, which can enable us to obtain the position, scale, and attitude angle of the tracked target simultaneously.
The generalization ability of this algorithm still maintains the same level as DSST and is very dependent on the HOG extraction algorithm.
6. Conclusion and Future Work
In this paper, on the premise of ensuring the tracking accuracy, we introduce the alterable patch number for target scale tracking and the space searching for target rotation tracking into the standard DSST tracking method and propose a multimodality tracking MTCF to simultaneously cope with translation, scale, and rotation in plane for the tracked target and to obtain the target information of position, scale, and attitude angle at the same time. Experimental results demonstrate that the proposed multimodal target tracking algorithm MTCF (1) can reach the approving tracking speed which is largely exceeded 25 fps at least for practical visual object tracking by appropriately decreasing the number of patches for target scale tracking and (2) can obtain good tracking performance for translation, scale, and rotation simultaneously. In the future, our work will focus on the distributed hardware and software implementation of the proposed multimodal comprehensive tracking algorithm.
For terminal devices not equipped with GPU units, low-resolution video is used to reduce the computational pressure on target features. For edge devices with certain computing capabilities, they are responsible for the main target tracking tasks. Finally, for a few critical and high-risk areas, the network bandwidth saved by the above two is used to upload to the central cloud processor for calculation to achieve hierarchical governance coordination.
Data Availability
All the source codes and related pictures will be uploaded to GitHub and will be available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61772575, National Key R&D Program of China under Grant 2017YFB1402101, and Independent Research Projects of Minzu University of China.
Supplementary Materials
S1: influence of different size of searching window and analysis of the results. S2: 1-dimensional correlation rotation tracking, θ=30°.
LiuG.LiuS.MuhammadK.SangaiahA. K.DoctorF.Object tracking in vary lighting conditions for fog based intelligent surveillance of public spaces2018610.1109/access.2018.28349162-s2.0-85046742086NishidaY.SonodaT.YasukawaS.Underwater platform for intelligent robotics and its application in two visual tracking systems2018302238247De BruinA.BooysenM. J.Drone-based traffic flow estimation and tracking using computer vision2015ZhangZ.ZhangX. Q.ZuoD. C.FuG. D.Research on target tracking application deployment strategy for edge computing2020319GaoH.KuangL.YinY.GuoB.DouK.Mining consuming behaviors with temporal evolution for personalized recommendation in mobile marketing apps20202541233124810.1007/s11036-020-01535-1MaX.GaoH.XuH.BianM.An IoT-based task scheduling optimization scheme considering the deadline and cost-aware scientific workflow for cloud computing20192019110.1186/s13638-019-1557-3ShengX.LiuY.LiangH.LiF.ManY.Robust visual tracking via an improved background aware correlation filter20197DongX.ShenJ.YuD.WangW.LiuJ.HuangH.Occlusion-aware real-time object tracking2016194763771FiazM.MahmoodA.JavedS.JungS. K.Handcrafted and deep trackers201952214410.1145/33096652-s2.0-85065711483BolmeD.BeveridgeJ. R.DraperB. A.LuiY. M.Visual object tracking using adaptive correlation filtersProceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2010New York, NY, USAIEEE25442550http://ieeexplore.ieee.org/document/5539960/HenriquesJ. F.CaseiroR.MartinsP.BatistaJ.High-speed tracking with kernelized correlation filters2014373583596DanelljanM.HägerG.KhanF. S.FelsbergM.Discriminative scale space tracking201639815611575TaoR.GavvesE.SmeuldersA. W. M.Siamese instance search for trackingProceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2016New York, NY, USAIEEE14201429http://ieeexplore.ieee.org/document/7780527/%20LiB.WuW.WangQ.ZhangF.XingJ.YanJ.SiamRPN++-evolution of siamese visual tracking with very deep networks20181812HintonG. E.SalakhutdinovR. R.Reducing the dimensionality of data with neural networks2006313578650250410.1126/science.11276472-s2.0-33746600649HintonG. E.OsinderoS.TehY.-W.A fast learning algorithm for deep belief nets20061871527155410.1162/neco.2006.18.7.15272-s2.0-33745805403LiP.WangD.WangL.LuH.Deep visual tracking: review and experimental comparison2017132117126DanelljanM.HägerG.KhanF. S.FelsbergM.Learning Spatially Regularized Correlation Filters for Visual TrackingProceedings of the 2015 IEEE International Conference on Computer Vision (ICCV)2015New York, NY, USAIEEE43104318http://ieeexplore.ieee.org/document/7410847/YunS.ChoiJ.YooY.YunK.ChoiJ. Y.Action-decision networks for visual tracking with deep reinforcement learningProceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017New York, NY, USAIEEE13491358http://ieeexplore.ieee.org/document/8099631/BertinettoL.ValmadreJ.HenriquesJ. F.VedaldiA.TorrP. H. S.Fully-convolutional siamese networks for object trackingProceedings of the Computer Vision-ECCV 2016 Workshops2016Berlin, GermanySpringer850865NamH.HanB.Learning multi-domain convolutional neural networks for visual trackingProceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2016Berlin, GermanyIEEE42934302http://ieeexplore.ieee.org/document/7780834/FanH.LingH.SANet: structure-aware network for visual trackingProceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)May 2017Berlin, GermanyIEEE22172224http://ieeexplore.ieee.org/document/8015009/Al-HalahZ.StiefelhagenR.How to transfer? Zero-shot object recognition via hierarchical transfer of semantic attributes2015VinyalsO.BlundellC.LillicrapT.WierstraD.Matching networks for one shot learning2016Berlin, GermanyIEEE36303638DongX.ShenJ.WuD.GuoK.JinX.PorikliF.Quadruplet network with one-shot learning for fast visual object tracking20192873516352710.1109/tip.2019.28985672-s2.0-85061524356ZhangH.NiW.YanW.WuJ.BianH.XiangD.Visual tracking using siamese convolutional neural network with region proposal and domain specific updating20182752645265510.1016/j.neucom.2017.11.0502-s2.0-85036630935LiB.YanJ.WuW.ZhuZ.HuX.High performance visual tracking with siamese region proposal networkProceedings of the IEEE Conference on Computer Vision and Pattern Recognition2018New York, NY, USA89718980DaiK.WangD.LuH.SunC.LiJ.Visual tracking via adaptive spatially-regularized correlation filtersProceedings of the IEEE Conference on Computer Vision and Pattern Recognition2019London, UK46704679BolmeD. S.DraperB. A.BeveridgeJ. R.Average of synthetic exact filtersProceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition2009Berlin, GermanyIEEE21052112HenriquesJ. F.CaseiroR.MartinsP.BatistaJ.Exploiting the circulant structure of tracking-by-detection with kernelsProceedings of the European Conference on Computer Vision2012Berlin, GermanySpringer702715DanelljanM.HägerG.KhanF.FelsbergM.Accurate scale estimation for robust visual trackingProceedings of the British Machine Vision Conference2014Nottingham, UKBMVA PressZhouH.YuanY.ShiC.Object tracking using sift features and mean shift2009113334535210.1016/j.cviu.2008.08.0062-s2.0-59349094120GárateC.BilinskyP.BremondF.Crowd event recognition using hog trackerProceedings of the 2009 Twelfth IEEE International Workshop on PerforMance Evaluation of Tracking and Surveillance2009Berlin, Germany16DanelljanM.HagerG.Shahbaz KhanF.FelsbergM.Convolutional features for correlation filter based visual trackingProceedings of the the IEEE International Conference on Computer Vision (ICCV) WorkshopsDecember 2015Berlin, GermanyWuY.LimJ.YangM.-H.Online object tracking: a benchmarkProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2013London, UKGirshickR. B.FelzenszwalbP. F.McAllesterD.Discriminatively trained deformable part models2012