Robust Visual Tracking Based On Convolutional Sparse Coding

This paper proposes a target tracking algorithm based on convolutional sparse coding (CSC). This algorithm first uses the CSC to divide the interest region into a smooth image and four detail images with different fitting degrees. Then, the smooth image is tracked to produce the initial result based on the kernel correlation filtering (KCF). According to the initial value of the target area and the linear combination of four detail images, the appearance model of the details is constructed. And the matching between samples and the appearance model is performed according to the overlap rate and Euclidean distance to determine the tracking results of the detail images. In the end, the two tracking results including the one based on the smooth image and the other one based on the detail images are linearly combined to determine the final position of the target in the new frame. Many experiments on the video sequences from Tracking Benchmark 2015 show that: our method produces much better result than most of present visual tracking methods.


I. RELATED WORK
The current tracking methods are mainly divided into two categories: generative tracking methods and discriminative tracking methods. The generative methods are proposed to directly describe the target and find the target sample with the maximum likelihood probability or posterior probability as the current estimate of the real target. The discriminative tracking methods make full use of target information and background. They train a classification model between the information of target and background to judge the target.
The generative tracking methods find the candidate target closest to the model of target as the current estimate in the high-dimensional space described by the target. For example, Black et al. [1] proposed a subspace-based method to calculate the radiation transformation between the current image containing the target and the image reconstructed with feature vector. Based on it, Ross et al. [2] updated the basis of feature space online with SKL. Mei et al. [3] solved the sparsity between the target template and the subspace compose of positive and negative trivial templates through the regularized least squares problem which can overcome problems such as illumination variations and occlusions effectively during the tracking progress. Subsequently, many algorithms [4]- [5] have emerged based on L1 algorithm for continuous improvement and improvement. Although the generative tracking methods made a great breakthrough, they are limited by accurately separating the background and target.
The discriminative methods learn to distinguish between target and background. Such as Support Vector Machine [6], the target and background are distinguished by dividing positive and negative samples. Based on SVM, Hare et al. [7] proposed Structured Support Vector Machine (SSVM) to further enhance its discriminating ability. And based on SSVM, Ning et al. [8] proposed Dual Linear Structured Support Vector Machine (DLSSVM) to enhance the ability to progress high-dimensional features and sampling capabilities.
For the discriminative tracking methods, the most famous one is based on correlation filtering [9]- [15]. Bolme et al. [9] proposed MOSSE tracking algorithm which first apply correlation filtering to the target tracking. It performs a convolution calculation on the Fourier domain between the target template and the sample. The larger the response value of the sample, the more similar it is to the target template. Based on MOSSE, Henriques et al. [10] introduced the cyclic matrix and kernel method to the target tracking field, and convolved the Dense Sampling sample in the tracking process with the cyclic matrix formed by the target template in the Fourier domain. And then he proposed the CSK tracking algorithm. Based on CSK, while guaranteeing the speed of tracking, Henriques et al. [11] introduced HOG feature to convert a single channel to multiple channels. And then, Bertinetto et al. [12] integrated HOG feature and color feature to further improve the accuracy of tracking. Based on Spatially Regularized Correlation Filters (SRDCF), Danelljan et al. [13] used the depth feature of single-layer convolution in CNN to replace the HOG feature in SRDCF, improving the accuracy of tracking. After that, Danelljan et al. [14] improved the speed and the stability of the algorithm based on C-COT by reducing model parameters, reducing the number of samples, and adopting a sparse update strategy. Valmadre [15] used an end-to-end approach to treat correlation filtering as a layer in CNN which guaranteed tracking efficiency.
Recently, tracking method with deep features [16]- [19] have become popular. Li et al. [16] proposed a method for learning target perception features, integrating the target perception features with Siamese matching network. This method identify targets with significant appearance changes. Wang et al. [17] proposed SiamFC-based tracker, and its structure is divided into two parts: "rough matching" and "fine matching". Enhance robustness through generalized training in the rough matching stage and enhance discrimination power through distance learning network in fine matching. Du et al. [18] proposed a tracker to detect target corners. It first uses Siamese network to roughly judge the foreground and background to get the ROI. And then, it uses the relationship between the target template and ROI to highlight the corner regions and enhance features of the ROI for corner detection. In the end, it achieves more accurate bounding box estimation. Guo et al. [19] proposed a fully convolutional Siamese network structure. The Siamese Subnetwork is first used for feature extraction, and then the classification regression subnetwork is used for the prediction of the target frame, and fast and more accurate segmentation results can be obtained.
All the above tracking methods establish an appearance model of target to achieve the tracking algorithm. So, when encountering the program of model drift, it is very difficult to correct itself in the subsequent tracking process. Especially, when updated the target appearance model, if the online update ability of the model is too strong, it is easy to take the background information around the target as the target, thereby causing overfitting. But, if the online update ability of the model is too weak, when the target is deformed or partially blocked, it leads to loss of tracking, which means underfitting. This paper divides the target into the smooth image and the detail part based on CSC, and establishes appearance models for the two parts to implement the tracking algorithm. The visual tracking is achieved by combining the tracked results of two appearance models to improve the performance.
II. OVERALL FRAMEWORK This paper first extracts the interest area based on the target area, which means that the target area is expanded 2.5 times around. Then divide the interest region into a smooth image and four detail images with different fitting degrees by CSC. For the smooth image, the model is initialized and tracked based on the KCF. And for the detail images, first, linear combination of four detail images with different fitting degrees to build an appearance model of the detail images. Then, determine the tracking result of the detail images by calculating the overlap rate and Euclidean distance of the model to match the overlap rate and Euclidean distance of the sampled sample. Lastly, the tracking results of the detail and the smooth images are combined linearly to determine the tracking position of the target in a new frame. The flow of this algorithm is shown in Fig.1. It includes the target model initialization phase, target tracking and model update.

A. Model initialization phase
First, we extract the interest area based on the target area. And then, it is divided into a smooth image and four detail images with different fitting degrees based on CSC. For the smooth image, its initial appearance model is established according to the KCF. And for the detail images, linearly combine four detail images with different fitting degrees to establish the appearance model of detail images according to the size of the standard area.

B. Target tracking phase
Based on the target position and the size of the previous frame, expand the 2.5 times area in the new frame of the image, and the area is divided into a smooth image and the detail images based on CSC. Then, we track the two parts separately with template matching. For the smooth image, we track it based on the KCF. The position of the target in the smooth image is determined by calculating the probability value of the target in each position of the interest region in the smooth image. For the detail images, randomly select N samples whose size is same as the appearance model of the detail images in the interest region. The target position of images about the details in the new frame of image is determined by the overlap rate and Euclidean distance value of each sample and the appearance model. Lastly, by linearly combining the result of the smooth image and the detail images, the final position of the tracking target in the new frame is determined.

C. Model update phase
The model update includes the appearance model update of smooth image and detail images. For the smooth image, first, its appearance model is established according to the result of tracking. And then, linearly combining it with the old appearance model. And for the detail images, according to the tracking results of the target in a new frame, four new detail models with different fitting degrees are extracted and then linearly combined them to update the target appearance model of the detail images.
III. TARGET TRACKING BASED ON CSC This section, we first introduce how to divide interest region into a smooth image and four detail images with different fitting degrees based on CSC, and establish the appearance model of smooth image and detail images respectively as shown in Fig. 2. For the smooth image, tracking based on the KCF. And for the detail images, first, randomly select N samples in the interest region. Then, calculate the overlap rate and Euclidean distance of each sample block and the appearance model of the detail images to match the template to get the tracking result of the detail images. Lastly, linearly combine the tracking result of the smooth image and the detail images to get the position of tracking target in a new frame.

A. Initialize the target model based on CSC
In this paper, based on the method proposed by Gu et al. [21], the image is divided into the smooth part and the detail parts by using N filters, as shown in Fig. 3. The smooth part contains the image color and shape features, and the detail parts represents the image edge and texture structure features, as shown in = ⊗ + .
(1) Among them, represents the original image, ⊗ represents the smooth part, and represents the detail part.
is a low-pass filter and is a characteristic diagram. As shown in Fig. 2, the green frame is the target standard area, and the red frame is the interest region. Then the interest region is divided into a smooth image and four detail images with different fitting degrees based on CSC.

1) Initialize the target appearance model of the smooth image based on the KCF
The establishment of the appearance model of the smooth image is based on the KCF. So, the first step to initialize the tracking target is construct a cyclic matrix ( ) by extracting target features. And then diagonalize the cyclic matrix to obtain the diagonal cyclic feature matrix using the Discrete Fourier transform, as shown in = ( ) .
(2) Among them, represents the Constant Fourier matrix, and = 1. the generated vector after Fourier transform. Then, solve the least squares problem in the Fourier domain based on to train the target detector , as shown in (3) where is the first row of the kernel matrix on the Fourier domain and is the regression target training based on Gaussian kernel function. = . ( 2) Initialize the target appearance model of the detail images based on the linear combination As shown in Fig. 3, this paper constructs the model of images about the details by extracting the target area of the detail images. This paper uses 400 filters in the progress of CSC. So, it can get 400 detailed feature maps about tracking targets ( ~ ). In order to prevent under-fitting or overfitting, this paper overlap every 100 detail pictures, that are = ∑ , = ∑ , = ∑ , = ∑ . So, there will be four detailed model ( ~ ) with different fitting degrees that track target. Lastly, linear combination of , , and to get the final appearance model of the detail images which called , as shown in (4)

1) Smooth image tracking based on the KCF
Using the target position of previous image as the center position of current frame, we expand the target region 2.5 times to form the sampling area . Then extract the features of the sampling area to form a feature matrix . Using kernel functions to calculate the kernel correlation values of cyclic feature matrix and . It means that = ( ) where is a row vector from and through kernel function operation, and is a function that construct a cyclic matrix. So that is a round function. After that, we use target detectors and to do dot multiplication to get the response value matrix ( ) to each position of the tracking target in the sampling area, as shown in ( ) = ⊙ .
(5) The larger the value in the response value matrix, the greater the possibility of being target. So, the position corresponding to the maximum value in the response value matrix is used as the target position of the smooth image. 278

2) Detail images based tracking
For the tracking target of the detail images, we randomly select 400 samples ( ~ ) which size is same as the target model in the detail images of the interest region . Firstly, we calculate the overlap rate OR of each sample with the model, as shown in Width=( + ) − ( ( , ) − ( , )).
Width is the width of the overlapping part and High is the high of the overlapping part. and are the size of the target model. and are the size of the sample. For the remaining samples, we calculate the Euclidean distance between the sample and the model. Do the same as we did with the model, first, we extract the four sample detail maps with different fitting degrees superimposed by each 100 detail maps and perform linear combination to obtain the final sample . Then, we calculate the Euclidean distance as between and the model of detail part by EU= ( − ) .

C. The update of the appearance model of target
After completing the tracking of the current frame image, we need to update the appearance model of target according to the result of tracking. The update progress of the appearance model of target is divided into updating the appearance model of the smooth image and the detail images.

1) Update the appearance model of the smooth image
The update of the model of the smooth image is mainly to update the cyclic feature matrix which represents the target, and the target detector . After getting the position of the tracking target in a new frame, determine the cyclic feature matrix and the target detector of the target in the new frame. Then we combine the cyclic feature matrix with the target detector of target in the previous frame, and we use linear interpolation to combine them by = (1 − ) + . = (1 − ) + . (9) and are the updated cyclic feature matrix and target detector, and is the learning parameter.

2) Update the appearance model of the detail images
The update of the appearance model of the detail images is means that update the four models of images about the details with different fitting degrees. By the standard position of the target in a new frame, we extract the area corresponding to the standard position in the detail images of the four interest regions with different fitting degrees for linear combination to get a new appearance model of the detail images.

A. Quantitative analysis
We analyze the target tracking based on the accuracy of the center position error and the coverage success rate. the center accuracy is obtained by the center distance between the tracking result and the ideal target area, and the success rate is calculated based on the overlap rate of two. Compared with the other 6 target trackers, the center accuracy and the success rate of our algorithm performs better as shown in Fig.4. For complex backgrounds, the center accuracy and the success rate value are ranked first with 0.781 and 0.605. For in-plane rotation, the center accuracy is 0.745 and the success rate value is ranked first with 0.555. For illumination variations, the center accuracy is 0.712, ranking third, but relative to the first place KCF and the second DLSSVM differs only by 0.014 and 0.002. And the success rate value is ranked first with 0.523. See Table 1 and Table 2 for other details.

B. Qualitative analysis
This section mainly analyzes the effectiveness of this paper for the challenges of occlusion, illumination variations and Out-of-Plane Rotation, and compares the algorithm with other nine algorithms.
1) Occlusion Fig. 5 shows the comparisons about occlusion. Take Coke video and Jogging.1 video as examples. The target is blocked, such as frame 39 in Coke and frame 75 in Jogging.1, due to environmental factors, such as the leaves in Coke and the telephone pole in Jogging.1, during the movement. Many trackers easily lose target such as Struct and DLSSVM. If the target is blocked completely, the occlusion will be regarded as the target and lead to failure. We construct appearance model of the smooth image and the detail images to track the target, avoiding the model drift problem caused by a single model.

2) Illumination variation
As shown in Fig. 6, it is a comparison chart of the tracking effects of various algorithms under illumination variation. We use the Human8 video and the Man video as examples. In the Humann8 video, from frame 2 to frame 20, we can get it that people first walk from a bright place to a dark place, and from frame 86 to frame 113, people walk from a dark place to a bright place. When people walk pass the shadow, we can find that tracking failed when using CFNet or Struck, because they extract target gray features. The light gradually increases from dark to bright in the Man video sequence. The figure shows that the tracking effect of the algorithm in this paper is good.  Fig. 7 shows the tracking results of various algorithms under Out-of-Plane Rotation factors. Take Liquor video sequence and SUV video sequence as examples. In the Liquor video, such as shown in frame 386 and frame 401, during the movement of the target around the brown bottle, part of the target is not in the image sensor, thereby Out-of-Plane Rotation. In the SUV video, such as shown in frame 28 and frame 47, the fast movement of the car makes the camera unable to keep up, so that part of the target exceeds the image range. As shown in frame 432 in the Liquor video sequence and frame 63 in the SUV video sequence, if the target shown in the image completely, this algorithm in this paper can track the target in time. This is because when the algorithm in this paper matches the appearance model of detail, it calculates the overlap rate between the samples and the models to match, and in the tracking of the smooth image, it can distinguish the background and the target. Combining the advantages of the two parts to guarantee the robustness of the algorithm when the target exceeds the image range. This paper designs a target tracking algorithm based on the CSC. First, it extracts an interest region based on the target area. Then, based on CSC, we divide it into a smooth image and four detail images with different fitting degrees. For the smooth image, based on the KCF, it is initially modeled and tracked. For the detail images, we first extract four detail images and combine them to initialize the appearance model of the detail images. In the process of tracking, we randomly collect 400 samples in the interest region, and then we calculate the overlap rate and Euclidean distance between each sample and the appearance model to determine the tracking result of the detail images. By combining the tracking result of the smooth image and the tracking result of the detail images, the final position of the tracking target in a new frame is obtained. Our method establishes the appearance models of both the smooth image and detail images to achieve target tracking and to avoid the problem of model drift easily caused by a single model to improve the accuracy and robustness. We do the quantitative and qualitative analysis with the current trackers on the Tracking Benchmark2015. Many experiments show our method performs better under challenges such as illumination variations and complex background.