Visual Object Multimodality Tracking Based on Correlation Filters for Edge Computing

tracker based on correlation ﬁlters (MTCF) to simultaneously cope with translation, scale, and rotation in plane for the tracked target and to obtain the target information of position, scale, and attitude angle at the same time. Finally, in Visual Tracker Benchmark data set, the experiments are performed on the proposed algorithms to show their eﬀectiveness in multimodality tracking.


Introduction
Visual object tracking (VOT), the subfield of computer vision, is a process of continuously estimating the target state through video image sequence. In recent years, VOT has become a very active research domain due to its extensive applications in many sorts of fields such as intelligent surveillance [1], automatic driving [2], and traffic flow monitoring [3], to name a few.
In fields such as security monitoring and control, the traditional network architecture is difficult to deal with in terms of network delay and security reliability, and thus edge computing technology was born. Tasks with different attributes can be passed to different levels for processing.
Zhan [4] shows that the first few feature extraction layers could run on edge device, and the others run on the cloud. And Gao [5,6] divides tasks into different levels according to the business applications and using edge devices in one level.
As Figure 1 shows, for nonsensitive areas, video streams with lower resolutions can be processed on the local device; in the medium area, ordinary-resolution video streams can be used on the edge device; and in high-risk areas, highresolution video streams can be used on the core cloud server, thereby reducing network bandwidth and improving the overall operating efficiency of the system. is article mainly explores the processing of video streams on edge clouds, and the tracking algorithm used is based on filtering.
For DLC, since the papers written by Geoff Hinton et al. [15,16] were published, deep learning has become especially popular in the context of deep neural networks and has achieved impressive success on many applications, especially on feature extraction in computer vision. Inspired by such success, various deep-learning-based trackers [13,14,[17][18][19][20][21][22] have been proposed and developed to cope with the problems encountered in tracking. Although most of trackers based on deep neural networks demonstrated the potential advantages for significantly improving the tracking performance which was testified by world VOTcompetitions [17], there are still some obvious limits. For example, there are fewer or even no training data available for the tracker because the prior information of the tracked object or the object bounding box is usually available only in the first frame. Even if the offline pretraining is employed to learn the target features for constructing a feature set of many targets, it is very possible for tracking a particular object whose features are not contained in the feature set. Nowadays, zeroshot and one-shot learning, as well as Siamese region proposal network, may be the most effective measures to cope with these problems [23][24][25][26][27].
And correlation filter-based tracking is also a solution. From its beginning of the minimum output sum of squared error (MOSSE) method [10] to the discriminative scale space tracker (DSST) method [12], a lot of improvements have continuously been made, which makes tracking based on CF achieve some highlighted tracking performances, such as lower computational load, being robust to the appearance variations of targets, and high tracking accuracy. However, there are still some improvements that need to be further studied for the tracking based on CF. One is that the algorithms do not consider the target rotation on purpose. e other is that the real-time property of DSST cannot always be ensured because it has a very heavy computational load to extract the histogram of oriented gradient (HOG) features from so many patches centered at the target position in order to ensure the scale estimation accuracy. is inspired us to think of an idea: on the premise of ensuring the tracking accuracy, appropriately decrease the number of patches in order to save the time for the introduction of the target rotation into DSST to form a multimodality tracking. It means that the tracker should simultaneously cope with translation, scale, and rotation in plane for the tracked target, which leads us to propose the visual object multimodality tracking based on correlation filters (MTCF), to figure out these two problems, and at the same time to obtain the target information of position, scale, and attitude angle simultaneously.
In this paper, we design a correlation filter-based tracker aiming at tracking the target accurately and robustly with the tracking speed at 25 frames per second and tracking the rotation of target.

Related Materials
In this section, centering on tracking based on CF, we briefly list some relevant research works which have contributions to the tracking based on CF to highlight our motivations.
MOSSE method is taken as the earliest real-time CFbased tracker [28], which is an improved version of average synthetic exact filter (ASEF) [29] trained offline to detect objects. MOSSE tracker has strong robustness to target appearance and environment change, which can achieve very fast tracking speed. is is because that the correlation convolution of image in time domain is transformed into the multiplication of image in frequency domain, which greatly reduces the computation complexity and load. However, MOSSE method uses only grayscale samples to train CF and mainly focuses on translation without considering scale and rotation.
Based on the MOSSE, the circulant structure kernel CSK method [30] constructs a circulant matrix of training data by using cyclic shift of target window to maintain dense sampling around the target, rather than random sampling. On the other hand, CSK method maps ridge regression of linear space to nonlinear space through a kernel function and simplifies the calculation via solving a dual problem in nonlinear space to avoid inverse matrix operation, which leads to reducing the computation complexity and improving the tracking speed. e kernelized correlation filter (KCF) method [11] is an improved version of CSK. It introduces multichannel HOG features into CSK to enhance the feature representation ability and to improve the tracking performance significantly. Nevertheless, there exists a major imperfection for KCF method; i.e., it is not robust to the scale variation of the target. In addition, for the KCFbased tracking, the authenticity of negative samples will decrease along with the increase of cyclic displacement, which results in the tracker being trained on a portion of unreal samples. To address this issue, Danelljan et al. [18] introduce a spatial regularized term in the goal function of KCF-based tracker to penalize the filter coefficients near the margins of the bounding box. Based on [18], Dai et al. [28] propose a novel adaptive spatial regularized CF to make the tracker learn more reliable filter coefficients by fully exploiting the diversity information of different objects in different frames during the tracking process. However, just as the standard KCF-based trackers do, these two trackers are still not robust to the scale variation of the target.
DSST [12,31] trackers address the scale adaptation problem using multiscale searching strategies. It divides tracking into translation prediction and scale prediction.
End node: low-resolution video | low risk Edge node: normal-resolution video | normal risk The cloud: high-resolution video | high risk, high sensitivity Firstly, translation prediction is performed by applying a standard translation filter on the current frame to get the position of the target. Secondly, the target size is estimated by employing trained scale filter at the target location obtained from the translation filter. Translation filter and scale filter are two independent filters, and both are based on MOSSE. Although DSST tracker has improved the tracking performance and is robust to target scale variation, there exist some obvious limitations to be further perfected. One is that DSST does not consider the target rotation on purpose, which has strong negative impacts on the tracking performance. e other is that it is not necessary for guaranteeing the tracking speed to spend a lot of operation time on sampling too many patches centered on the target location.
Besides of the tracking method, features of the tracked target are also key components of a tracker, which has a very heavy influence on the tracking performance. Generally speaking, the richer the features are, the better the performance of the tracker is. e simplest feature is intensity matrix of the search image, which is used in MOSSE [10]. And SIFT features [32] and HOG features [33] are used in object tracking afterwards. In recent years, deep features [34] are widely used in object tracking. In this paper, HOG features combined with grayscale features rather than deep features are adopted because our focus is on the CF-based tracking. And we do not adopt SIFT because SIFT is scaleinvariant and we need to explicitly capture the size change of the object.
Summarizing the analysis stated above, we propose the MTCF to alleviate the imperfections of the relevant CFbased trackers stated above. Aiming at tracking the target accurately and robustly with the tracking speed at 25 frames per second at least for practical visual object tracking, MTCF consists of 4 tasks. Firstly, based on the standard CF-based translation tracker, determine the target location in the current frame. Secondly, based on DSST, sample several patches (with alterable number of patches) with different resolutions, centered at the tracked target location determined by translation CF, figure out the feasible scale for patches, and seek out an optimal decision policy to find the final scale among feasible scales.
irdly, based on the standard CF-based translation tracker, design a rotation tracker using space searching. Lastly, integrate the previous 3 tasks to form MTCF.

Variable Symbols Used in is Paper.
In this paper, f denotes the "feature" of one image patch cropped with specific bounding box, h denotes the correlation filter, and g denotes the response map of correlation. In this way, f trans,i denotes the feature of i th frame used to correlate with translation filter h trans and we get the translation response map g trans,i .
And s i denotes the scale of the target the tracker got after i th frame, and r i denotes the rotation angle of the target after tracking i th frame.
In terms of the convolution theorem, the correlation in spatial domain can be transformed to element-wise manipulation, which will dramatically reduce the correlation computation load.
us, for computation efficiency, correlation manipulation is proposed to use Fast Fourier Transform (FFT) method in frequency domain. So, let the uppercase variables be the Fourier transforms of their lowercase counterparts, i.e., F trans,i , G trans,i , and H trans corresponding to f trans,i , g trans,i , and h trans , respectively.

Standard Translation Tracker Based on Correlation
Filter. As Figure 2 shows, given a video sequence, draw a rectangular bounding box (the very close same size as the target, the red one) around the target in first frame and extract a feature map f trans,i from the chosen region (the green rectangle, two times the size of the red one). And then train a correlation filter h trans to correlate with f trans,1 to get an ideal response g trans,1 . In the next frames, use the correlation filter h trans to correlate with extracted feature map from the chosen region and get a response map g trans,i as follows: where * represents convolution operation.
In normal tracking process, there should be one peak in the response map. And the peak position is considered as the center of target (and in this sense tracking executes). e key of tracking is to find a robust feature extractor and maintain the correlation filter h trans to counter a variety of adverse effects such as target appearance transformation, occlusion, and so on, using appropriate updating strategy.

Scale Tracker Based on Correlation Filter.
Being different from that in the original DSST, the number of scales S (or the number of image patches) in this paper is an optional positive integer determined by the trade-off between tracking speed and tracking accuracy (i.e., smaller S is selected if tracking speed takes priority to tracking accuracy, vice versa). Let M, N be the shape of the target, and construct image patches centered on the target position p i with different scales to form an image patch set where β is scale step. Resize each β n M × β n N from B scale,i into the same shape to form a bounding box set patch scale,i . As Figure 3 shows, instead of extracting one feature map from a bounding box with fixed scale, the tracker extracts a feature map f n scale,i for each patch from the bounding box set patch scale,i (the number of feature maps is 33 in Figure 3).
Each feature map f n scale,i is concatenated into a vector, and all these vectors are combined into a feature map f n scale,i . And we design a scale correlation filter to correlate the feature map f n scale,i , and the scale where maximum response taking place is the predicting scale to match current scale of target.

Rotation Tracker Based on Correlation Filter.
e target may rotate during tracking, so we use rotated bounding box Security and Communication Networks 3 centered on target to crop every frame. As shown in Figure 4, let p i be the center of the target and r the current angle the target rotated, and B r rotate,i denotes the bounding box with the rotation in frame i. And using the rotation around target center, we construct a set of bounding boxes with the same size of B r rotate,i ; here Θ is the given maximum rotation angular displacement and θ is rotation step.
For each bounding box B r rotate,i from B rotate,i , extract feature map f rot rotate,i , and correlate with the rotation correlation filter to get a maximum response value, where Compare those values and find the largest one to get the predicting rotation angle r. Let r be the tracking result of frame i, as follows: In addition to the methods we used here, we also envisioned the "1-dimensional correlation rotation tracking" in the Supplementary Materials. However, after testing, it shows that this method requires too much calculation and is not suitable for use at edge nodes.

Multimodality Tracking Based on Correlation Filter.
Integrate translation, scale, and rotation stated in previous section to form MTCF whose iteration procedure at the i th

Rotation Estimation
(1) Construct image patches patch rotate,i from the bounding box set B rotate,i centered on target position p i with rotation angle r i− 1 (2) Extract feature maps f rot rotate,i for every patch from patch rotate,i (3) For every feature map f rot rotate,i , make the correlation with the original rotation filter, and get a maximum response value score rotate,i (4) Update the target rotation angle r i with the optimal rot corresponding to the best score rot rotate,i

Translation Tracking Procedure.
e simplest correlation-based tracking only focuses on translation of the target. In the first frame, we label a rectangular region B trans,1 centered on the target. So, the tracker can extract the feature map of target appearance. e feature map must maintain a spatial mapping because the tracker uses the position where the maximum response happens to predict new target position.
e simplest feature map is gray intensity matrix transformed from the specific region (for example, B trans,1 ) of original frame. Many researchers use 2-dimensional Hanning window (see Figure 5) to preprocess the primitive intensity matrix. After being processed by the Hanning window, the intensity matrix focuses on the central region of target and weakens the background information near the bounding box edge. Because in the first frame we draw the bounding box tightly around the target, the tracker may lose some features and behave unstable.
To address this issue, the simple way is to expand the search region. Define a parameter bb to determine how many times the size of B trans,1 the search window is. As a rule, greater parameter bb will contribute to extracting more features of the target and making the tracker stable. But much more time will be spent on the extracting features from the large search region. is is clearly demonstrated by "S1" presented in the Supplementary Materials. In this paper, we adopt a trade-off policy and select the bb � 2.
From now on, we will use B trans,1 to represent the translation search windows.
From B trans,1 we get the f trans,1 , and because we need to train the translation correlation filter h trans , an initial g trans,1 is required. In prior papers, most of researchers take the Gauss-shaped response map as initialization, as follows: and Figure 6 shows an example of g trans,1 . ough the intensity feature is cheap in computation, it is unstable. Because it only takes advantage of little information in the frame. Recently, lots of deep features (for example, convolution neural network feature) are introduced to object tracking and behave well in accuracy and robustness. However, it is too computational expensive, and in this paper, we focus on FHOG [36] feature.
We use an FHOG feature extractor to get the feature map f trans,i . In translation step, 27 dimensions in FHOG and 1 dimension of intensity feature are taken into account. According to DSST [12], discriminative correlation filters for multidimensional features are applied as follows.
Minimize the cost function,

Security and Communication Networks
Here, g is the ideal response about the correlation between feature map and filter, and the parameter λ ≥ 0 is the regularization term. In FFT domain, the solution [12] can be written as where * indicates the complex conjugation, ⊙ is for element-wise multiplication, and l ∈ 1, . . . , d { } is the dimension number. e translation filter H trans can be solved as below: Equation (9) is employed in offline learning to obtain the correlation filter. In the practical tracking, the tracker (for example, MOSSE, KCF, and DSST) takes the target position in the (i − 1) th frame as the center of bounding box B i in the i th frame, extracts feature map f trans,i from B i , and then calculates the correlation map to determine the target new position p i corresponding to the element with maximum value in s trans,i ; here, F − 1 indicates the inverse Fourier transform.
Afterwards, reconstruct bounding box B i centered on the target new position p i from which feature map f trans,i is extracted, and then update the translation CF to get H * trans,i . Lastly, an iterative formula for equation (9) is presented as the following equations from equations (9)-(12) according to [10,12]: where η is the learning rate.

Scale Tracking Procedure.
As for scale tracking procedure, two methods are commonly used. One is called "exhaustive scale tracking" and the other is "1-dimensional correlation filter scale tracking." In this paper, we use "1dimensional correlation" method.
In the previous frame, we got the target position p i− 1 and scale s i− 1 .
Let M, N be the shape of the target, construct image patches centered on the target position p i in terms of the method presented in Section 3, and resize to form a bounding box set patch scale,i . FHOG extractor is applied to extract a feature map f n scale,i for each patch from the bounding box set patch scale,i . Each feature map f n scale,i is concatenated into a vector, and all of these vectors are combined into an integrated vector f scale,i . Estimating the target scale can be solved by learning a separate 1-dimensional correlation filter. Design a 1-dimensional filter h scale to correlate with f scale,i . e initial ideal response g scale,i is a Gauss-shaped peak, as Figure 7 shows. e scale with the largest correlation response value is taken as the optimal scale s i .
Afterwards  target final scale s i , and then update the scale correlation filter to get H * scale,i using equations (13)- (17). In this process, set the parameter "spatial bin size" to 4 to save time in the next process and use all FHOG dimensions. So, the length of S feature vector is (M/4) × (N/4) × 31.
Estimating the target scale can be solved by learning a separate 1-dimensional correlation filter. Treat the feature vector as multidimensional features and S vectors turn into Construct different groups containing different number of patches. e number of patches varies from small to large (e.g., from 10 to 55), and all patches are centered at the tracked target location determined by translation CF in current frame. Let the basic DSST [12] perform on the visual track data set [35], and calculate the tracking speed and the tracking accuracy which is characterized by the Euclidean distance between tracking window center and ground truth center for each group. e experiment results are shown as in Figure 8.
From Figure 8, it can be seen that the tracking accuracy and speed are different with different numbers of patches. e larger number of the patches corresponds to a low tracking speed, and vice versa.
us, on the premise of ensuring the tracking accuracy, appropriately select the number of patches in order to save the time for the introduction of the target rotation into DSST to form a multimodality tracking. After experiments, it is found that most of the time is spent in the feature extraction module.

Rotation Space Tracking
Procedure. Set the target attitude with r � 0 in the first frame; construct a set of bounding boxes B rotate,i as described in the previous section in successive frame. FHOG extractor is applied to extract a feature map f rot rotate,i for each patch B rot rotate,i from the bounding box set B rotate,i . Estimating the target rotation can be solved by learning a separate 1-dimensional correlation filter. Train a 1-dimensional a single rotation filter h rotate as the similarity function to compute the maximum correlation response max g rot rotate,i for each feature map f rot rotate,i . erefore, the best tracking angle r i is calculated by using the following equation:   rotation angle r i , and then update the rotation correlation filter to get H * rotate,i using the following equations: We take Figure 9 as an example to demonstrate our search rotation angle. As Figure 9 shows, r � 0 ∘ , θ � 20 ∘ , Θ � 40 ∘ , androt ∈ − 40, − 20, 0, 20, 40 { }. Construct a set of bounding boxes B rotate,i with 5 patches B rot rotate,i ; train the rotation correlation filter h rotate using the samples in the first frame. And Figure 10 shows the correlation response with each patch, where the r + 20 ∘ corresponds to the highest response. us, we can make a conclusion that r + 20 ∘ is the best predicting rotation angle in Figure 9, which demonstrate the effectiveness of our proposed search rotation method.
In this process, how to set parameters of θ and Θ is very important to obtain the good tracking performance including tracking speed and tracking accuracy. Greater Θ and smaller θ will contribute to the good tracking performance, but much more time will be spent on the extracting features of the tracked target, which has a negative influence on the tracking speed. is is clearly demonstrated by "S2" presented in the Supplementary Materials. As a rule, parameters of θ and Θ are fixed by experiments according to the requirements of tracking tasks.
In this paper, we also adopt such a policy.

Experiment Setup.
In this paper, our method is implemented in MATLAB R2019a on Windows 10 system. e experiments are conducted on a PC with Intel Xeon ® 2.4 GHz and 63.9 GB RAM. e data set is selected from the visual track data set [35]. Our experiment is divided into 3 groups with different parameters. All of them are used to testify our proposal method: on the premise of ensuring the tracking accuracy, appropriately decrease the number of patches in order to save the time for the introduction of the target rotation into DSST to form a multimodality tracking, to verify the effectiveness of our proposed rotation tracking algorithm, and to demonstrate the whole tracking performance of our proposed visual object multimodality tracking algorithm based on correlation filter. In the experiment of each group, the visual track data set is selected to have target translation, scale, and rotation simultaneously. And the number of scales S, the scale factor β, and the learning rate η are kept unchanged in each group and are fixed as (33, 1.02, and 0.015) and (27, 1.0247, and 0.015), respectively, which means that maximum and minimum scale field of two groups are the same, as Figure 11 shows. And we test the influence of different size (bb) of searching window in the Supplementary Materials.

Experiment of the First Group.
In this group experiment, Θ is selected to be 10 ∘ , and θ is selected to be 5 ∘ . As a result, the tracking speed is 31fps, and the experiment results are shown in Figures 12 and 13 consisting of some typical tracking frames.
From Figure 11, it can be seen that appropriately decreasing the number of patches completely can save the time for the introduction of the target rotation into DSST to form a multimodality tracking on the premise of ensuring the tracking accuracy and that our proposed rotation tracking algorithm can work well.

Rotation Tracking Performance Test.
In this group experiment, θ is selected to be 12, and τ is selected to be 4. As a result, the tracking speed is 29 fps, and the experiment results are shown in Figure 13 consisting of some typical tracking frames. In this group experiment, the tracking speed is 29 fps which is lower than that in the first group experiment because τ is selected to be 4 which means the number of B ρ (rotate,i) is increased, resulting in much more time being spent on extracting feature map f ρ (rotate,i) from B ρ (rotate,i) . But our proposal visual tracker still can work well in tracking the target with translation, scale, and rotation. is can be shown by Figure 13. From this perspective, we can say that the rotation step can be appropriately increased if tracking accuracy is preferred, and vice versa.

Multimodality Tracking Performance Test.
In both of the two group experiments, our proposed MTCF algorithm is performed on the visual track data set [35] to demonstrate the multimodality tracking performance; the tracking results are shown in Figure 14.
From Figure 14, it can be seen that our proposed MTCF has good multimodality tracking performance, which can enable us to obtain the position, scale, and attitude angle of the tracked target simultaneously. e generalization ability of this algorithm still maintains the same level as DSST and is very dependent on the HOG extraction algorithm.

Conclusion and Future Work
In this paper, on the premise of ensuring the tracking accuracy, we introduce the alterable patch number for target scale tracking and the space searching for target rotation tracking into the standard DSST tracking method and propose a multimodality tracking MTCF to simultaneously cope with translation, scale, and rotation in plane for the tracked target and to obtain the target information of position, scale, and attitude angle at the same time. Experimental results demonstrate that the proposed multimodal target tracking algorithm MTCF (1) can reach the approving tracking speed which is largely exceeded 25 fps at least for practical visual object tracking by appropriately decreasing the number of patches for target scale tracking and (2) can obtain good tracking performance for translation, scale, and rotation simultaneously. In the future, our work will focus on the distributed hardware and software implementation of the proposed multimodal comprehensive tracking algorithm. For terminal devices not equipped with GPU units, lowresolution video is used to reduce the computational pressure on target features. For edge devices with certain computing capabilities, they are responsible for the main target tracking tasks. Finally, for a few critical and high-risk areas, the network bandwidth saved by the above two is used to upload to the central cloud processor for calculation to achieve hierarchical governance coordination.

Data Availability
All the source codes and related pictures will be uploaded to GitHub and will be available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.