Co-Metric Learning for Person Re-Identification

Person re-identification, aiming to identify the same pedestrian images across disjoint camera views, is a key technique of intelligent video surveillance. Although existingmethods have developed both theories and experimental results,most of effective ones pertain to fully supervised training styles, which suffer the small sample size (SSS) problem a lot, especially in label-insufficient practical applications. To bridge SSS problem and learning model with small labels, a novel semisupervised co-metric learning framework is proposed to learn a discriminative Mahalanobis-like distance matrix for label-insufficient person re-identification. Different from typical co-training task that contains multiview data originally, single-view person images are firstly decomposed into pseudo two views, and then metric learning models are produced and jointly updated based on both pseudo-labels and references iteratively. Experiments carried out on three representative person re-identification datasets show that the proposed method performs better than state of the art and possesses low label sensitivity.


Introduction
Person re-identification (re-id), namely, seeking occurrences of a query person (probe) from person candidates (gallery), is a hot-spot and challenging topic of intelligent video surveillance [1,2], which also underpins many crucial multimedia applications, such as person retrieval [3,4], long-term pedestrian tracking [5,6], and cross-view action analysis [7].The main challenge of re-id can be concluded as intrapersonal visual variations across multicamera views even larger than interpersonal ones, due to the significant changes in viewpoints, illuminations, body poses, and background clutters (see Figure 1).Moreover, traditional biometrics, such as gait and face, are unreliable to be exploited especially in uncontrolled practical environment; thus researchers always carry out person re-identification task based on body appearance characteristics.Current person re-identification methods have been primarily introduced to two aspects: feature construction and learning, or subspace/metric learning.Due to more and more attention from computer vision and machine learning fields in recent years, researchers bring great improvements on both theories and experimental results of person re-identification study.
(ii) Subspace and metric learning aims at seeking a proper subspace or distance measure by Mahalanobis-like metric learning [21][22][23][24][25][26][27][28][29][30].Given a set of person image pairs, metric learning based methods are to learn an optimal positive semidefinite matrix for the validity of metric that maximizes the probability of true matches pair having smaller distance than wrong match pairs.
Whether feature learning or metric learning methods, state of the art usually exploits the characteristics of labelled training data as far as possible, which typically pertains to fully supervised method.However labels are always insufficient in practical applications, resulting in the number of labelled training samples even smaller than that of feature dimensions, namely, small sample size (SSS) problem [31] that is a core challenge of learning based person re-identification.To solve the SSS issue, there are many training styles designed Advances in Multimedia for noisy learning and inadequate supervision [32], and cotraining is always one of the most important paradigms that is still vibrant for multiview learning [33].Therefore, motivated by semisupervised co-training [34], we propose a novel cometric learning framework for person re-identification to bridge the inadequate labelled data and metric learning model.
In a typical co-training work, training data is adopted to study classification models in two views separately, whereas the updates of models benefit from each other's views.However, different from applications where data is collected from multimodal sources, person re-identification datasets are commonly presented as single-view pedestrian images.In that case, the core difficulty of applying co-training paradigm in person re-identification community comes at learning and updating a model in single view.As we know, the features in higher dimension own more useful information but larger noise, such that dimension reduction is always necessary for feature extraction.If we decompose the high-dimension features into two views before dimension reduction, it is probably to produce different but effective descriptions in pseudo two views for our co-metric learning framework.Therefore, we firstly present a binary-weight learning method for splitting the single-view representation to pseudo two views automatically, and then two metric learning models are studied, respectively, in each view for matching the unlabelled training samples; finally metrics benefit each other and meanwhile are jointly updated based on the ranking list of unlabelled samples iteratively.
The main contributions of this paper can be summarized as follows: (1) An effective co-metric learning framework is presented for semisupervised person re-identification; it can learn a discriminative Mahalanobis-like distance matrix, even lacking adequate labelled data.(2) Pseudo two views of person data could be used for metrics generation based on self-adaptive feature decomposition.(3) Both pseudo-labels and references on unlabelled dataset are adopted for acquiring discriminative metrics update.The rest of the paper is organized as follows.Section 2 introduces a brief review of related work for person re-identification.Section 3 explains our method in detail.Section 4 presents experimental results compared with state of the art on three datasets.Section 5 concludes this paper.

Related Work
In this section, we give a brief review of the studies most related to person re-identification task.Typically, current person re-identification research can be categorized into two classes: feature representation based methods and distance measure based methods.
Feature representation based methods pay attention to constructing discriminative visual descriptions by feature selection or learning.Gheissari et al. [8] generated salient edges based on a spatial-temporal segmentation algorithm and then obtained an invariant identity signature by combining normalized color and salient edge histograms.Wang et al. [9] designed a co-occurrence matrix based appearance model to capture the spatial distribution of the appearance relative to each of the object parts.Farenzena et al. [10] tried to combine multiple features from five body regions that are exploited by symmetry and asymmetry perceptual principles.Kviatkovsky et al. [11] found that color structure descriptors derived from different body parts turn out to be invariants under different lighting conditions.To improve the discriminative power of visual descriptions, feature selection technique is used to pick out more robust feature weightings, or dimensions, or patch salience.Gray et al. [12] transformed person re-identification into a classification problem and employed an ensemble of the localized features through AdaBoost algorithm.Zhao et al. [13] applied adjacency constrained patch matching to build dense correspondence between image pairs and assigned salience to each patch in an unsupervised manner.Some recent works introduce deep learning framework to acquire robust local feature representations and then encoding them.Li et al. [14] learned a unified deep filter by introducing a patch matching layer and a max-out grouping layer for person re-identification.Ahmed et al. [15] presented a deep convolutional architecture that captured local relationships between person images based on mid-level features.Generally, deep learning is usually utilized to learn feature representations by using deep convolutional features [14][15][16][17] or from the fully connected features [18][19][20] in person re-identification works.
Distance measure based methods aim at finding out a uniform distance measure by subspace learning or metric learning.Most successful metric learning algorithms demonstrate an obvious superiority based on supervised learning.Hizer et al. [21] and Dikmen et al. [22] utilized a classical metric learning method called LMNN to learn an optimal metric for person re-identification.Zheng et al. [23] learned a Mahalanobis distance metric with a probabilistic relative distance comparison method.Kostinger et al. [24] introduced a simpler metric function (KISSME) to fit pairwise samples based on Gaussian distribution hypothesis, and Tao et al. [25] got better estimation of the covariance matrices of KISS metric learning by seamlessly integrating smoothing and regularization.Mignon et al. [26] learn distance metric from sparse pairwise similarity/dissimilarity constraints in high dimensional space called pairwise constrained component analysis.Pedagadi et al. [27] conducted a metric-like work that combined unsupervised PCA dimensionality reduction and Local Fisher Discriminant Analysis.Li et al. [28] proposed to learn a decision function that joined distance metric and locally adaptive thresholding rule.Wang et al. [29] transformed the deep learning as the most popular machine learning paradigm is also adopted to learn the distance metric.Wang et al. [30] put forward a data-driven distance metric method, re-exploiting the training data to adjust the metric for each query-gallery pair.

Methodology
This section presents the main procedures of our co-metric learning framework (see Figure 2), mainly including selfadaptive feature decomposition for pseudo two-view metric learning, semisupervised metric update based on pseudolabels and references.
M is a positive semidefinite matrix for the validity of metric.By performing matrix decomposition on M with M = L  L, (1) can be rewritten as It is easy to see from the above derivation that the essence of the metric is to seek an optimal projection matrix M (or L) under the supervised information generally containing two pairwise constraints, i.e., similar constraint and dissimilar constraint.However, access to labelled data is usually difficult or too expensive to obtain; comparatively unlabelled data is massive and easily acquired.Therefore learning based on both labelled and unlabelled samples is not only meaningful issue but also pressing for practical intelligent video surveillance.

Self
So    is similar to    and meanwhile dissimilar to    as much as possible by applying  1 .(⋅) denotes the normalized distance of objects; here Euclidean distance is adopted.Similarly, ℓ( 2 ) is constructed for  2 .And then, ℓ( 1 ), ℓ( 2 ) are trained with the constraints of (3) jointly through minimizing the maximum of the two as (5)  [24] of the difference space as
The above decision function can be simplified as ( 7) by the log-likelihood ratio test, and then distance between   and   can be written as (8): The original semidefinite matrix M in Mahalanobis-like metric function is reflected by . Since the ranking lists of unlabelled training samples are calculated based on (8), the core issue comes to how to use these ranking lists for metric update, and three observations could be helpful and important to answer the question.First, co-training style is promoting the models in two views teaching each other; thereby ranking list in one view should benefit model in another.Second, top-n samples in the ranking lists probably have more similar visual appearance as the probe, whereas the visual information of bottom-m samples is perhaps further dissimilar as that of the probe; thus the top-n and bottomm samples could be treated as positive and negative pseudolabels for iterative metric update of each other's view.Third, the aim of co-training is to reach an agreement between two views just as increasing consensual pseudo-labels from both views.In that case, top-k neighbours of consensual pseudolabels on unlabelled samples set   may be also useful for metric update, and they could be regarded as special references.Therefore, we attempt to learn a generic model (M 1 ) that updates metric learning model by discovering both pseudo-labels and references.
Assume that a metric model M 1 is learned in view And then,  2 (M 1 ) is to both pull the pseudo-positives  +  to referential-positives  +  and pull the pseudo-negatives  −  to referential-negatives  −  close enough as arg min Finally, metric update becomes an optimizing problem with the following objective function: Gradient descent algorithm is adopted to optimize (11), and learning procedure of metric model M 2 is similar to that of M 1 .The final M 1 , or M 2 , or combination after iterative update can be utilized for test dataset.

Experimental Results
In this section, the proposed method is validated by comparing with state-of-the-art person re-identification approaches on three publicly available datasets: the VIPeR dataset [35], PRID2011 dataset [43], and PRID450s dataset [44].The widely used VIPeR dataset contains 632 person image pairs obtained from two different cameras.Some example images are shown in Figure 3 4.1.Implementation Details.Both hand-crafted and deeply learned features are adopted as the original single-view representations in this paper.Hand-crafted feature employs salient color name [42], and deeply learned feature is produced by a typical Siamese convolutional neural network [45].All the quantitative results are exhibited in standard Cumulated Matching Characteristics (CMC) curves [9], which are plots of the recognition performance versus the rank score and represent the expectation of finding the correct match inside top  matches.Following the evaluation protocol described by state of the art [23], dataset is randomly divided into two parts, a half for training and the other for testing.However, different from fully supervised methods that training data are all labelled, only one-third of labelled data are used in this semisupervised person re-identification evaluation while the remaining training data are unlabelled, similarly to [37].All images from camera view A are treated as probes and those from camera view B as gallery set.For each probe image, there is one person image matched in the gallery set.With two different methods, we use the same configuration for experiments at each trial to get the ranking lists.To achieve stable statistics, we repeated the evaluation procedure for 10 times.

Experiments on VIPeR.
We compare our co-metric learning (CML) based person re-identification method with ten most published unsupervised, semisupervised, and fully supervised results on the VIPeR dataset.Unsupervised/semisupervised approaches include SDALF [10], eSDC [13], TSR [36], SSCDL [37], Null-semi [38], and fully supervised baselines including KISSME [24], kLDFA [39], DeepNN [15], Null Space [38], and XQDA [40].Semisupervised person reidentification usually assumes the availability of one-third of the training set, while the whole training set of fully supervised approaches is labelled and adopted in learning procedure.To show the quantized comparison results more clearly, we summarize the performance comparison (see Table 1).
As can be seen, we make the following observations: (1) our method achieves 32.9% at rank@1 matching rate, which improves the previous best results over 1.1%, and matching rates at rank@5 and rank@10 also possess the highest performance compared with all unsupervised/semisupervised results.
(2) Compared with fully supervised baselines, our result is also competitive, especially at rank@1; e.g., the performances of KISSME and kLDFA are both lower than that of our CML.(3) Although there is still long way compared with best fully supervised result, our approaches only need one-third labelled training data, which is more suitable for label-insufficient practical environment.

Experiments on PRID2011.
Compared with VIPeR dataset, the number of person images on PRID2011 is small, where training sample size may be much smaller than feature dimension; i.e., SSS problem can be worse.We compare the state-of-the-art semisupervised baselines kCCA [41], kLFDA [39], XQDA [40], and Null-semi [38] on PRID2011 with access to the implementation codes using the same LOMO features.It can be seen that (see Table 2) (1) except result at rank@10, rank@1, and rank@5, matching rate of our method is the best result compared with baselines, and there is only 0.2% margin below Null-semi that takes the best performance at rank@10.(2) Influenced by small sample size, our approach and baselines all yield much poorer results on PRID2011 dataset compared with results on VIPeR dataset.

Experiments on PRID450s
. Many published unsupervised/semisupervised SDALF [10], eSDC [13], and TSR [36] and fully supervised KISSME [24] and SCNCD [42] are introduced as baselines on PRID450s.The performance of our method is much better than all unsupervised/semisupervised comparisons (see Table 3).It achieves 61.8% at rank@5 and 73.8% at rank@10, which improves the previous best results over 10%.Moreover, for verifying the label sensitivity of our CML method, we test SCNCD with metric learning and our method with 1/2, 1/3, 1/5 labelled training samples (see Table 4).And our results exceed significantly that of SCNCD at every labelled training size and decrease gently along with lower training samples; however SCNCD declines sharply especially with 1/5 training samples.That is because the within-class scatter matrix of traditional metric learning becomes singular, when the number of labels is smaller than the dimension of feature representation.Relatively speaking, our method combines labelled and unlabelled data for learning procedure, which is more robust and less sensitive about label size.

Conclusions
This paper proposes a novel semisupervised co-metric learning framework for label-insufficient person re-identification.
To bridge the small sample size problem and learning model with small labels, motivated by co-training that is commonly used for insufficient/imperfect-label learning, we adopt binary-weight learning to decompose single-view person features into pseudo two views, which could be used to learn two metric models as a co-training style, and then metrics are jointly updated by discovering both pseudolabels and references.Experiments on three representative person re-identification datasets show that proposed method performs better than state of the art with small labelled sample size and possesses low label sensitivity.

Figure 1 :
Figure 1: Illustration of the person re-identification task.(a) Person image samples of pedestrian derived from VIPeR dataset [35], in which each column represents the same person images and each row represents images observed from the same camera views, and appearance of the same person images changes severely in different camera views.(b) Example illustrating the characteristics of person re-identification in practical surveillance environment; as can be seen, gait and face are infeasible to be exploited by reason of low resolution and occlusion.

Figure 2 :
Figure 2: Flowchart of the co-metric learning framework for person re-identification.Single-view features of training data are firstly decomposed into pseudo two views for learning corresponding metric models.And then ranking list of unlabelled dataset in each view could be generated via distance measurement.Finally, positive and negative pseudo labels that are the top-n and bottom-m samples of ranking list, respectively, and references that are the top-k neighbours of consensual pseudo-labels (marked red) are jointly utilized for metric update.

Figure 3 :
Figure 3: Some samples of two public datasets.Each column shows two images of the same person from two different cameras.(a) VIPeR dataset.(b) PRID2011 dataset.
(a).All images of individuals are normalized to a size of 128×48 pixels.View changes are the most significant cause of appearance change with most of the matched image pairs containing a viewpoint change of 90 degrees.Other variations are also considered, such as illumination conditions and the image qualities.The PRID2011 is a challenge dataset from two surveillance cameras; particularly there is serious camera characteristics variation as shown in Figure3(b).In particular, 385 persons' images are from one camera and 749 persons' images are from the other camera, with 200 common images in both views.All images are normalized to 128×48 pixels.The PRID450S is an extension of PRID2011; it has significant and consistent lighting changes and chromatic variation, and there are 450 single-shot image pairs captured over two spatially disjoint camera views.All images are normalized to 168 × 80 pixels.
3.1.Problem Formulation.Under a semisupervised person re-identification setting, it considers a pair of cameras   and   with nonoverlapping field of views and training persons set  = {  ,   }.Labelled training persons set   = { 1 ,  2 , . . .,   } is associated with the two cameras, where  is the number of persons.Images of persons captured from   and   are denoted as    and   .Two labelled training sets corresponding to   and   are represented by  , = { 1  , . . .,    , . . .   }, 1 ≤  ≤ , and  , = { 1  , . . .,    , . . .   }, 1 ≤  ≤ , where  =  means the same person   .Then let unlabelled training persons set   = { +1 ,  +2 , . . .,  + },  , = { +1 however   and   may not be the same pedestrian here even if  = .A classical supervised metric learning algorithm [21] trains a Mahalanobis-like distance function based on  , and  , .Given a pair of training samples    and ∈ R [34]ptive Feature Decomposition.Given a set of single-view training samples, it aims at producing pseudo two-view representations that could be used for learning metric model in each view.In a typical co-training task, there is dataset  consisting of two feature views  1 and  2 , which satisfy two conditions[34]: (1) two hypotheses ℎ 1 , ℎ 2 ∈ Η occur having low-error on  1 ,  2 ; (2)  1 and  2 need to be conditionally independent.To achieve the above demands, a binary learning method based on binary-weight vectors  1 ,  2 is proposed to decompose single-view features  ∈ R  with dimension  into two totally different but both effective views  1 ,  2 automatically, which could be treated as pseudo two-view features of training samples.  1 =   *   1 ,   2 =   *   2 .  ,   1 ,   2 indicate the th dimension of ,  1 ,  2 , respectively, 1 ≤  ≤ .To make  1 ,  2 conditionally independent, a succinct way is that   can only be used by one of   1  1 ,  2 are trained together on the labelled samples set   , ensuring that feature representations generated from  1 ,  2 both perform well.   , Update.After acquiring pseudo two-view representations  1 ,  2 of person images, Mahalanobis-like metric model would be learned each from one view for matching the unlabelled training samples.Consider a pairwise difference Δ =   −   ,   ,   ∈ , where  is the person dataset and Δ is the intrapersonal difference if   =   , namely,   = 1, while Δ is the interpersonal difference if   ̸ =   , namely,   = 0. Mahalanobis-like metric can be learned via zero-mean Gaussian structure 1on labelled samples set   and unlabelled training samples  = { 1 , . . .,   , . . .,   }, 1 ≤  ≤  on   . +  ,  −  are used to define the positive and negative pseudo-labels of   from metric model M 2 in view  2 ,  + = { + ,1 ,  + ,2 , . . .,  + , },

Table 1 :
CMC values (%) at top ranks on VIPeR dataset.Best results are shown in bold text.

Table 2 :
CMC values (%) at top ranks on PRID2011.Best results are shown in bold text.

Table 3 :
CMC values (%) at top ranks on PRID450s.Best results are shown in bold text.