1. Introduction

Advances in Multimedia

1687-5699 1687-5680

Hindawi

10.1155/2018/7481645

7481645

Research Article

Visual Tracking Based on Discriminative Compressed Features

http://orcid.org/0000-0002-0051-7355

Liu

Wei

¹ Wang

Hui

² Zhang

Lei

Department of Modern Education Technology

Ludong University

Yantai

China

ytnc.edu.cn

Lab

CNCERT/CC

Yumin Road No. 3A

Beijing 100029

China

2018

182018

2018 03 04 2018 13 06 2018 11 07 2018 182018

2018

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Visual tracking is a challenging research topic in the field of computer vision with many potential applications. A large number of tracking methods have been proposed and achieved designed tracking performance. However, the current state-of-the-art tracking methods still can not meet the requirements of real-world applications. One of the main challenges is to design a good appearance model to describe the target’s appearance. In this paper, we propose a novel visual tracking method, which uses compressed features to model target’s appearances and then uses SVM to distinguish the target from its background. The compressed features were obtained by the zero-tree coding on multiscale wavelet coefficients extracted from an image, which have both the low dimensionality and discriminate ability and therefore ensure to achieve better tracking results. The experimental comparisons with several state-of-the-art methods demonstrate the superiority of the proposed method.

Shandong Province Higher Educational Science and Technology Program

J14LN64

1. Introduction

Visual tracking aims at locating the target of interest from an image sequence, which is one of the most activated research topics in the field of computer vision with many potential applications such as video surveillance, human-computer interaction, navigation, and automatic driving. It has attracted increasing interest in the past few decades [1–16]. However, due to a variety of challenging factors such as illumination changes, pose deformation, and occlusion, the performance of visual tracking is still far away from requirements in practical applications. The main difficulty is that it is not easy to design a good appearance modeling method, which is not only good at distinguishing the target from its background but also being robust to the above-mentioned appearance changes. Finding a good appearance modeling is a challenging problem in many visual applications such as image classification [17–19] and video recognition [20–22].

In the literature, there are a variety of visual tracking methods with focus on developing effective appearance modeling methods. Most of these methods can be classified into two groups: generative methods and discriminative methods. The former learns generative features from samples that only contain the target, whose purpose is to represent the target as accurate as possible. The latter learns discriminative features from samples including both the target and its background, which usually involves solving an optimization function. To achieve better tracking performance, discriminative methods attracted more attention.

In this paper, to overcome the challenges caused by low contrast, illuminative changes, and scale changes, we propose a novel tracking method using discriminative compressed features, which is real-time and able to process multiple scales of the target. The main idea of the proposed method is that it combines compressive sensing and multiscale texture transformation to extract compressed texture features and then uses SVM to classify the target from its background. The compressed features have both the low dimensionality and discriminate ability and therefore ensure to achieve better tracking results. The experimental comparisons with several state-of-the-art methods demonstrate the superiority of the proposed method.

The rest of this paper is organized as follows. In Section 2, we review the work closely related to our proposed approach. Section 3 gives a detailed description of the proposed tracking method. Experimental results are reported and analyzed in Section 6. We conclude this paper in Section 6.

2. Related Work

In the past decades, there are many tracking methods that have been proposed, which can be roughly divided into generative methods and discriminative methods. The former focuses on modeling the appearance of the tracked target and then finds the candidate that is the most similar to the target template as the tracking result. The representative methods include those trackers based on sparse representation [23–29]. In [29], sparse coding is used to extract features from sampled patches. The local sparse features are then pooled into a global representation. In [28], an online learning sparse representation is proposed for visual tracking to handle occlusion. In [25], a joint sparse representation framework is used to combine multi-cue features for visual tracking. Since features from different cues describe the tracked target from different aspects, more robust tracking results can be obtained when multi-cue features are used. In [23], a biologically inspired appearance model is proposed to model target appearance, which is also based on features extracted using sparse coding.

The discriminative methods learn a binary classifier, which is then used to classify a candidate as the target or background [5, 8, 14, 16, 30–34]. In [30], Yakut and Kehtarnavaz proposed to track ice-hockey pucks by combining three pieces of information in ice-hockey video frames using an adaptive gray-level thresholding method. In [31], Topkaya et al. proposed a multiple object tracking method using tracklet clustering, which first obtains short yet reliable tracklets and then clusters the tracklets over time based on color and spatial and temporal attributes. In [32], Wang and Zhao proposed an adaptive appearance model called Principal Component-Canonical Correlation Analysis (P3CA) to extract discriminative features for object tracking. In [14], Qi et al. propose a CNN based tracking method, which uses correlation filters to construct six weak trackers on outputs of six CNN layers. These weak trackers are then adaptively combined by a Normal Hedge algorithm. In [34], a further improved method is proposed which uses a SNT to compute the loss of each weak tracker, which achieves better tracking performance.

3. Discriminative Compressed Features 3.1. Multiscale Wavelet Transformation

Multiscale wavelet is a kind of wavelet which consists of more than two scale functions. It preserves the local properties of time-frequency domains while overcoming the drawbacks of a single wavelet and therefore has more properties of different frequencies. In this paper, we choose the GHM multiscale wavelet [35], which can be obtained by recursively calculating as follows:(1)vj,k=∑mGm-2kvj-1,m(2)wj,k=∑mHm-2kvj-1,mwhere vj,k and wj,k are low-frequency coefficients and high-frequency coefficients of the jth scale of the input signal, respectively. vj-1,m denotes the low-frequency coefficients of the (j-1)th scale; k and m are the indices of the current scales, which are dependent on the input image. The multiwavelet filters are defined as(3)G0=35245-120-3102G1=352092012(4)G2=00920-3102G3=00-1200(5)H0=110-12-32123H1=11092-102-920(6)H2=11092-3292-3H3=110-120-120

3.2. Compressed Multiscale Features

It is easy to obtain low-frequency components and high-frequency components after the signals are filtered by wavelet transformation. In general, most energy of the signal is in the low-frequency components. In contrast, high-frequency components of the signal reflect the details of the input image. Therefore, the simplest way of compressing the input image is to set the high-frequency coefficients to be zero when reconstructing the input image using wavelet transformation. The other option is to set the high-frequency coefficients of some local regions to be zero or to set the high-frequency coefficients based on a threshold, which will cause severe loss of image details, blurred images after compression, or loss of image information.

Wavelet transformation is able to composite the input image at different scales. More importantly, the subimage at each resolution has different frequency properties and different orientation selections. Therefore, it can be used to encode different information of the input image at different scales.

It is widely thought of the fact that the targets in a video sequence are redundant in both spatial and frequency domains. The former indicates the adjacent pixels have spatial correlation. The latter indicates that the adjacent frequencies of a pixel have some kinds of correlation. On the other hand, the statistical features of image signals indicate that large coefficients always exist in low-frequency regions and therefore small bits can be assigned to those small coefficients or they will not be transmitted at all. It will cause high compression rates and very small information loss.

The compression method based on multiscale wavelet transformation applies the zero-tree coding to compression of high spectral images. The principle behind this method is that it exploits the structure correlation of high spectral images to construct only one effective (shared) image and then further determine the positions of nonzeros of multiscale wavelet coefficients. The shared image is obtained by combining multiscale frequency coefficients and therefore removes spatial redundancy and frequency redundancy with the purpose of improving compression efficiency.

The one-dimensional wavelet transformation filters the input signal by low-pass filtering and high-pass filtering and then obtains low-frequency components and high-frequency components by downsampling. According to Mallat algorithm, two-dimensional wavelet transformation can be implemented by several one-dimensional wavelet transformation and obtain low-frequency and high-frequency components, respectively. Given an input image with m rows and n columns, the process of 2D wavelet transformation is that it first decomposes the input image along its each row using 1D wavelet transformation, which will obtain L and H two parts. The second step is to decompose the L and H parts along its column using 1D wavelet transformation. With these two steps, the input image will get four parts (LL, HL, LH, and HH). The second level, third level, or higher level’s wavelet transformation can be obtained by using such a process on the former level. Therefore, the wavelet transformation is an iterative process.

To meet the real-time requirements, the dimensionality of appearance features should not be too high. To meet this requirement, in this paper, we adopt compressive sensing to reduce the dimensionality of high-dimensional appearance features. Let u∈RD be the wavelet features and Γ be a random matrix computed using the same method as in [26]. The compressed features v∈Rd can be computed as v=Γu.

4. Discriminative SVM Tracking

SVM is for classic binary pattern classification since it was proposed by Vapnik in 1995. In this paper, we use SVM as our tracking model.

4.1. SVM Tracking

To classify the target from its background, our tracking method tries to find a hyperplane in the D-dimensional compressed feature space to distinguish the features of the target and its background.

To achieve this aim, the optimization objective is to maximize the classifier’s margin in the feature space. In other words, we need to meet the following conditions:(7)xi·w+b≥0 if yi=+1(8)xi·w+b≤0 if yi=-1 where yi is the class label of the ith sample. For example, if the sample is target, yi=+1. Otherwise, if the sample is background, yi=-1.

Given training samples and their corresponding labels, we first extract compressed features from each sample using the method introduced in Section 3. The features with their labels can then be fed to SVM to train SVM’s parameters. In the tracking stage, for each target candidate, we can also extract the compressed features using the same method as like in the training stage. Then we can feed the extracted features to SVM to predicate its label. If the features are classified as +1, it is considered as the potential target. Otherwise, it is not considered as the potential target. The final target is selected as the potential target candidate with the largest probability.

4.2. Model Update

To make the proposed tracker adapt to target appearance changes over time, the tracker needs to be updated online. To this aim, we update the model using the collected positive and negative samples. In particular, we collect a set of positive and negative samples at time t. Using the proposed appearance model, we can extract the compressed features for all positive and negative samples. Then the SVM model can be updated as(9)ul1=λul1+1-λul1(10)ul0=λul0+1-λul0(11)δl1=λδl12+1-λδl12+λ1-λul1-u12(12)δl0=λδl02+1-λδl02+λ1-λul0-u02where λ denotes the learning rate, which controls the speed of model updating.(13)u1=1α∑k=1αv1,lk(14)u0=1α∑k=1αv0,lk(15)δ1=1α∑k=1αv1,lk-u12(16)δ0=1α∑k=1αv0,lk-u02

5. Experiment Results

The target tracking is implemented in a particle filter framework. Several sequences from the OTB100 dataset have been chosen to evaluate the proposed tracking method. At the first frame, the target is initialized manually. Of course, the target can be initialized by a detector when the method is applied in real systems. After the target is initialized, a set of particles are sampled around the target. Whether each particle is considered as the target or not is based on the output of SVM scoring. In the next frame, the particles are sampled using the tracking result in the last frame as mean and a predefined covariance. The process is repeated frame by frame. The flowchart of the proposed tracking method is shown in Figure 1.

Figure 1

The flowchart of the proposed tracking method.

To test the performance of the proposed method, we compared the proposed method to several state-of-the-art trackers including TLD [36], CXT [1], Struck [37], L1APG [38], and MTT [39]. By quantitatively and qualitatively analyzing the experimental results, we demonstrate the outstanding performance of the proposed method.

Two frame based metrics widely used in tracking performance evaluation are (1) center location error, which is defined as the Euclidean distance between the central location of the tracked target and the manually labeled ground-truthed position; (2) bounding box overlap which is the ratio of the areas of the intersection and the union of the bounding box indicating the tracked subject and the ground-truthed bounding box. To measure the overall performance of a tracker on a test sequence, success rate and precision score are adopted. The former is computed as the percentage of image frames, which have a bounding box overlap larger than a given threshold. The latter is the percentage of image frames, which have a central position error less than a given threshold. In each case, when multiple thresholds are used, a curve is drawn to show how success rates or precision scores are affected by different thresholds. These curves are, namely, success plot and precious plot, respectively. In practical evaluations, we average the curves of a tracker over all the sequences, which have the same challenge and show a curve for each challenge item rather than a test sequence. In addition, we use the area under curve (AUC) of the success plot to quantitatively measure the overall performance of a tracker on a challenge item.

5.1. Quantitative Comparison

The overall precision plots and success plots are shown in Figure 2, from which we can see that the proposed method outperforms other methods in terms of the overall precision plots and success plots.

Figure 2

Overall precision plots and success plots on the test sequences.

5.2. Qualitative Comparison

To further show the superiority of the proposed method, we show several examples of tracking results on Figures 3 and 4. As we can see from Figure 3, the proposed tracker outperforms other trackers on several representative frames on two sequences. More tracking results are shown in Figure 4, from which we can see that the proposed tracker also achieves the best tracking performance.

Figure 3

Examples of tracking results on representative frames of two sequences.

Figure 4

Examples of tracking results on representative frames of other four sequences.

6. Conclusion

In this paper, we propose to use compressed features to model the tracked target’s appearance and then use SVM to perform tracking. The experimental results indicate the proposed method outperforms several state-of-the-art methods. The advantages of the proposed method are twofold: (1) It is good at handling scale changes of the target over time because the used features are obtained by multiscale wavelet transformation. (2) The speed of the proposed method can achieve real-time because the dimensionality of the used features was reduced by compressed sensing techniques.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The research is supported by Project of Shandong Province Higher Educational Science and Technology Program (no. J14LN64).

Dinh

T. B.

Medioni

Context tracker: exploring supporters and distracters in unconstrained environments

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '11)

June 2011

1177 1184

10.1109/cvpr.2011.5995733

2-s2.0-80052910974

Henriques

J. F.

Caseiro

Martins

Batista

Exploiting the circulant structure of tracking-by-detection with kernels

Proceedings of the European Conference on Computer Vision

2012

702 715

Kwon

Lee

K. M.

Tracking by sampling trackers

Proceedings of the 2011 IEEE International Conference on Computer Vision (ICCV '11)

November 2011

1195 1202

10.1109/ICCV.2011.6126369

2-s2.0-84863049929

Han

De With

P. H. N.

Real-time multiple people tracking for automatic group-behavior evaluation in delivery simulation training

Multimedia Tools and Applications 2011 51 3 913 933

2-s2.0-79751537756

10.1007/s11042-009-0423-4

Han

Jiao

Combined feature evaluation for adaptive visual object tracking

Computer Vision and Image Understanding 2011 115 1 69 80

2-s2.0-78751590975

10.1016/j.cviu.2010.09.004

Han

Jiao

Zhang

Liu

Visual object tracking via sample-based Adaptive Sparse Representation (AdaSR)

Pattern Recognition 2011 44 9 2170 2183

2-s2.0-79957445890

10.1016/j.patcog.2011.03.002

Han

Pauwels

E. J.

De Zeeuw

P. M.

De With

P. H. N.

Employing a RGB-D sensor for real-time tracking of humans across multiple re-entries in a smart environment

IEEE Transactions on Consumer Electronics 2012 58 2 255 263

2-s2.0-84863755451

10.1109/TCE.2012.6227420

Gao

Han

Jiao

Real-Time Multipedestrian Tracking in Traffic Scenes via an RGB-D-Based Layered Graph Model

IEEE Transactions on Intelligent Transportation Systems 2015 16 5 2814 2825

2-s2.0-84959508793

10.1109/TITS.2015.2423709

Zhang

Chen

Strobel

Comaniciu

Robust object tracking using semi-supervised appearance dictionary learning

Pattern Recognition Letters 2015 62 17 23

2-s2.0-84930617963

10.1016/j.patrec.2015.04.010

Zhang

Zhou

Yao

Zhang

Wang

Zhang

Adaptive NormalHedge for robust visual tracking

Signal Processing 2015 110 132 142

2-s2.0-84922885598

10.1016/j.sigpro.2014.08.027

Zhang

Kasiviswanathan

Yuen

P. C.

Harandi

Online dictionary learning on symmetric positive definite manifolds with vision applications

Proceedings of the AAAI Conference on Artificial Intelligence

January 2015

3165 3173

2-s2.0-84960116604

You

Tao

Tang

Y. Y.

Connected component model for multi-object tracking

IEEE Transactions on Image Processing 2016 25 8 3698 3711

10.1109/TIP.2016.2570553

MR3519391

2-s2.0-84978197342

Liu

Wang

Zhang

Chen

W.-S.

A multi-view model for visual tracking via correlation filters

Knowledge-Based Systems 2016 113 88 99

2-s2.0-84994791007

10.1016/j.knosys.2016.09.014

Zhang

Qin

Yao

Huang

Lim

Yang

M.-H.

Hedged deep tracking

Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016

July 2016

4303 4311

2-s2.0-84986246054

Cheung

Y.-M.

You

Tang

Y. Y.

Robust Object Tracking via Key Patch Sparse Representation

IEEE Transactions on Cybernetics 2017 47 2 354 364

2-s2.0-84960539269

Shi

Zhang

Xie

Gao

Zheng

Robust tracking with per-exemplar support vector machine

IET Computer Vision 2015 9 5 699 710

2-s2.0-84942044522

10.1049/iet-cvi.2014.0234

Wilf

Zhang

Chikkerur

Little

S. A.

Wing

S. L.

Serre

Computer vision cracks the leaf code

Proceedings of the National Acadamy of Sciences of the United States of America 2016 113 12 3305 3310

2-s2.0-84962310740

10.1073/pnas.1524473113

26951664

Liu

Lin

Shao

Shen

Ding

Han

Sequential discrete hashing for scalable cross-modality similarity retrieval

IEEE Transactions on Image Processing 2017 26 1 107 118

10.1109/TIP.2016.2619262

MR3579344

2-s2.0-85011954473

Guo

Ding

Liu

Han

Shao

Learning to hash with optimized anchor embedding for scalable retrieval

IEEE Transactions on Image Processing 2017 26 3 1344 1354

10.1109/TIP.2017.2652730

MR3623963

2-s2.0-85015278389

Zhang

Yao

Sun

Wang

Zhang

Action recognition based on overcomplete independent components analysis

Information Sciences 2014 281 635 647

2-s2.0-84904768479

10.1016/j.ins.2013.12.052

Jiang

Zhang

Gao

Zhao

Multi-layered gesture recognition with Kinect

Journal of Machine Learning Research (JMLR) 2015 16 227 254

MR3333008

Zbl1358.68246

Chen

Ding

Han

Attribute-based supervised deep learning model for action recognition

Frontiers of Computer Science 2017 11 2 219 229

2-s2.0-85013675478

10.1007/s11704-016-6066-5

Zhang

Lan

Yao

Zhou

Tao

A biologically inspired appearance model for robust visual tracking

IEEE Transactions on Neural Networks and Learning Systems 2017 28 10 2357 2370

MR3709753

10.1109/TNNLS.2016.2586194

2-s2.0-84978909358

Zhang

Lan

Yuen

P. C.

Robust Visual Tracking via Basis Matching

IEEE Transactions on Circuits and Systems for Video Technology 2017 27 3 421 430

2-s2.0-85015181074

10.1109/TCSVT.2016.2539860

Lan

Zhang

Yuen

P. C.

Robust joint discriminative feature learning for visual tracking

Proceedings of the 25th International Joint Conference on Artificial Intelligence

July 2016

3403 3410

2-s2.0-85006160220

Zhang

Zhou

Jiang

Robust visual tracking using structurally random projection and weighted least squares

IEEE Transactions on Circuits and Systems for Video Technology 2015 25 11 1749 1760

10.1109/TCSVT.2015.2406194

Zhang

Yao

Sun

Sparse coding based visual tracking: review and experimental comparison

Pattern Recognition 2013 46 7 1772 1788

10.1016/j.patcog.2012.10.006

2-s2.0-84875236224

Zhang

S. H.

Yao

Zhou

Sun

Liu

S. H.

Robust visual tracking based on online learning sparse representation

Neurocomputing 2013 100 31 40

10.1016/j.neucom.2011.11.031

2-s2.0-84868615679

Zhang

Yao

Sun

Liu

Robust visual tracking using an effective appearance model based on sparse coding

ACM Transactions on Intelligent Systems and Technology 2012 3 3 43:1 43:18

2-s2.0-84863626473

Yakut

Kehtarnavaz

Ice-hockey puck detection and tracking for video highlighting

Signal, Image and Video Processing 2016 10 3 527 533

2-s2.0-84958105070

10.1007/s11760-015-0764-6

Topkaya

I. S.

Erdogan

Porikli

Tracklet clustering for robust multiple object tracking using distance dependent Chinese restaurant processes

Signal, Image and Video Processing 2016 10 5 795 802

2-s2.0-84942019974

10.1007/s11760-015-0817-x

Wang

Zhao

Robust object tracking via online Principal Component–Canonical Correlation Analysis (P3CA)

Signal, Image and Video Processing 2015 9 1 159 174

2-s2.0-84934761680

10.1007/s11760-013-0430-9

Shan

Zhang

Visual tracking using IPCA and sparse representation

Signal, Image and Video Processing 2015 9 4 913 921

2-s2.0-84925291946

10.1007/s11760-013-0525-3

Zhang

Lei Qin

Hedging Deep Features for Visual Tracking

Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE T-PAMI)

2018

10.1109/TPAMI.2018.2828817

Sembiring

Sabzevary

A. S.

Akizuki

Stochastic process on multiwavelet

IFAC Proceedings Volumes 2002 35 1 211 215

Kalal

Matas

Mikolajczyk

P-N learning: bootstrapping binary classifiers by structural constraints

Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition

June 2010

49 56

10.1109/CVPR.2010.5540231

2-s2.0-77956005443

Hare

Saffari

Torr

P. H. S.

Struck: structured output tracking with kernels

Proceedings of the IEEE International Conference on Computer Vision (ICCV '11)

November 2011

Barcelona, Spain

IEEE

263 270

10.1109/iccv.2011.6126251

2-s2.0-84856659290

Bao

Ling

Real Time Robust L1 Tracker Using Accelerated Proximal Gradient Approach

Proceedings of the IIEEE Conference on Computer Vision and Pattern Recognition

June 2012

1830 1837

2-s2.0-84866285609

Zhang

Ghanem

Liu

Ahuja

Robust visual tracking via multi-task sparse learning

Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

2012