Impostor Resilient Multimodal Metric Learning for Person Reidentification

In person reidentification distance metric learning suffers a great challenge from impostor persons. Mostly, distance metrics are learned by maximizing the similarity between positive pair against impostors that lie on different transform modals. In addition, these impostors are obtained from Gallery view for query sample only, while the Gallery sample is totally ignored. In real world, a given pair of query and Gallery experience different changes in pose, viewpoint, and lighting. Thus, impostors only from Gallery view can not optimally maximize their similarity. Therefore, to resolve these issues we have proposed an impostor resilient multimodal metric (IRM3). IRM3 is learned for each modal transform in the image space and uses impostors from both Probe and Gallery views to effectively restrict large number of impostors. Learned IRM3 is then evaluated on three benchmark datasets, VIPeR, CUHK01, and CUHK03, and shows significant improvement in performance compared to many previous approaches.


Introduction
Person reidentification (Re-ID) matches a given person across a large network of nonoverlapping cameras [1], and is fundamentally used for person tracking in camera networks.Despite years of research, reidentification is still a challenging problem as the data space in Re-ID is multimodal (modal in our work is defined as the space which is formed by the joint combination of different changes a given pair images of the same person undergo in different camera views) and the observed images in different views undergo various different changes in poses [2], viewpoints [3], lighting [4], background clutter, and also experience occlusion.
Most approaches in Re-ID can mainly be divided into two categories: robust features extraction [5][6][7][8][9][10][11][12][13] for representation and globally learning distance metric for matching [14,15].These global metrics [16][17][18][19] project features into low dimension subspace where they tend to maximize the discrimination among different persons; however, these metrics still suffer a great challenge from impostor (an impostor is a person that belongs to the other person and, however, possess higher similarity with the given query than the right Gallery sample) samples [20,21].Though, in past some attempts are made to eliminate impostors [14,[20][21][22], however, all these attempts have not given due consideration of different transform modals on which the reidentification images lie [23].This situation is illustrated in Figure 1, where we have shown three transform modals  1 ,  2 , and  3 in the image space. 1 contains a positive pair (query and Gallery) enclosed in green rectangles for which a metric is learned, while there are two more pairs lying in modals  2 and  3 , respectively.View b images (enclosed in red rectangles) in  2 and  3 are similar to query in  1 and, thus, are impostors for query sample.In conventional approaches [14,[20][21][22], the metric between query and Gallery samples in  1 is learned using the impostor sample from  2 (Metric  1 ) or  3 (Metric  2 ) as a constraint.Therefore, when the similarity for positive pair is learned under the constraint of an impostor person lying on a different transform modal other than the positive pair, then the learned similarity metric would not be the optimal matching function, which can be proved from poor retrieval results in Ranklist 1 and Ranklist 2 in Figure 1.

Vi b (i t )
Push Push referred as IRM3, which eliminates the impostors largely and attains an optimal matching between positive pair.The objective of IRM3 is to maximize the matching of a positive pair against both the negative gallery samples (NGS) (samples which are not impostors and belong to different persons), as well as against impostors by taking into account the modal a given pair, its negative gallery samples, and its impostors reside.Further, in contrast to [14,[20][21][22], it also takes into consideration impostor samples for both the query and its respective Gallery sample.This pair of impostors are referred to as Cross views impostors (CVI) which are obtained for query and Gallery samples from their opposite views and help in further maximizing the similarity between given query and Gallery samples.The contributions of our impostor resilient multimodal metric IMR3 are as follows:

Rank
(i) Improving impostors resistance by jointly exploiting the transform modals [23], as well as impostor samples from both Probe and Gallery views; (ii) With our IMR3 approach a significant gain in performance is obtained in Multikernel Local Fisher Discriminant Analysis (MK-LFDA) [44].

Methodology
Figure 2 shows the framework of our IRM3.In Figure 2, first color and texture features are extracted from each training sample; then, different modals are discovered in the image space.These modals are discovered by using sum of squares clustering which is explained in Section 2.2.Finally, for each modal cross views impostors (CVI) (explained in Section 2.3) and negative gallery samples (NGS) (explained in Section 2.4) are generated to train the modal metric   for each transform modal .In our work, the modal metric   is learned using MK-LFDA [44], and the learning procedure is explained in Section 2.6.Finally, in Section 2.7 we have explained how we have performed matching between test query and Gallery.
2.1.Feature Extraction.RGB, HSV, LAB, YCbCr, and SCNCD histograms are extracted according to similar settings in [45] using 32 bins per channel, and settings in [12], respectively.Then, all five features are concatenated together.Similarly, DenseSIFT, SILTP, and HOG are extracted according to the settings in [46], [11], and [47], respectively, and are concatenated together.Dimension of color and texture features after concatenation become large, and since Re-ID data is multiview we have used CCA [48] to reduce dimension.However, to keep the local discriminative information of each type of feature we have applied CCA to color and texture features individually.By cross validation on VIPeR and CUHK03 we obtained optimal dimension for color feature to be 900, and texture feature to be 700.Finally, the reduced color and texture features are concatenated to form a feature vector  of size 1600.

Partition Image Space.
Let  be the image space of a camera view; then  is where   is the feature representation   of person  and  are the number of persons in .Since images in  lie on different transform modals, therefore, there exist distinct clusters of different modals in .Each of these modal clusters has its own unique transformation and visual patterns; thus, all the persons belonging to a modal  can be obtained using sum of squares clustering as where  are the number of modals in ,   is scatter matrix of within transform modals,  , is the association of   with transform modal , and   is the center of the th transform modal.
In (2), each modal center   is critical in discovering distinct, stable, and nonempty modals in .Thus, choosing any sample (   ,    ) as center   of any given modal , it is necessary to make sure it is a right choice.In order to make sure a chosen modal center is right it has to fulfill two conditions: First, (i) if the chosen sample (   ,    ) is a center of modal cluster , then, all the persons in modal  will be its neighbors, and it has the highest number of nearest neighbors.Second, (ii) center   and all its nearest neighbors lie on the same modal; therefore, these neighbors will share similar patterns with the center   in both Probe and Gallery views.Now, we compute the number of nearest neighbors for each person in training set by taking into consideration the above two conditions.For this purpose, we have used both Probe (   ) and Gallery (   ) samples of each person to obtain four lists of neighbors, which are computed from both camera views.To acquire most reliable neighbors we then select only top@40 (here, the reason to choose top@40 neighbors is to maintain maximum reliability with minimum time and memory cost in large datasets.For instance, when we have  = 16 modals in CUHK03, then, in each modal there will be at least 78 training persons.Now, to obtain a center sample   of any modal it must have at least 51% neighbors in that modal, and thus, we take top@40 neighbors which is in actual 52% proportion of the training persons in a modal to find out whether   is a center or not) (top@20 for VIPeR) neighbors from each list and then perform an intersection operation among all the four lists to obtain the cardinality value, as well as IDs of the neighbors which are common in both Probe and Gallery views of a given person.This cardinality value and the IDs of the obtained neighbors are then stored in a matrix.Further, this procedure is repeated for the rest of the remaining  − 1 persons in the training set, and then their cardinality values, as well as IDs of the neighbors, are also stored in the same matrix.
Using this matrix we will now obtain our  initial centers for  modal transforms.These  centers are chosen as the  top persons with highest number of neighbors.However, it could be possible that two or more persons can have the same cardinality value, as well as share the same nearest neighbors IDs.In that condition simply choosing top  persons will not be the best solution; instead, we chose only those top  persons that do not have any person IDs common in their neighbors lists.In addition, for situations where more than two persons have the same cardinality and share same neighbors IDs, we randomly chose any one person from Advances in Multimedia them to represent that modal center.Finally, getting the  modal centers the optimal partitioning of the image space  is obtained by minimizing the trace of within transform modals scatter matrix as arg min tr (  ) . ( Though the image space is partitioned into  modals, however, to ensure the obtained modals are distinct and stable (in our work a stable modal is formed when it contains at least 15% training persons) we have updated the modal centers and repartitioned the space for further  = 3 times.The modal centers are updated as where    is the number of persons in modal  and given as Computing the initial modal centers is computationally tedious in our work, however, it has still moderate computational burden.For the training size of  persons the complexity is about O( ×  × ), where  is the number of iterations, and  is the number of modals.

Cross Views Impostors (CVI).
After getting the distinct modals in the image space , we can now obtain the set of CVI for each positive pair (   ,    ) lying in modal  from both of its Probe and Gallery views.We believe in real world situation (open set) where a positive pair has always limited or few samples; these CVI can be exploited to deliver subtle and differentiating information in metric learning that can differentiate a given pair more efficiently against large number of diverse real world impostors, as well as negative gallery samples.These impostors are obtained by comparing the similarity value of a given person pair against the other persons in Gallery and Probe views.First, the similarity values for a Probe sample    are computed with the whole Gallery view using metric  ini and CCA reduced feature  as where   and   are CCA reduced feature  of person  and , while  ini is a globally learned metric with feature  using K-LFDA [45].We have used linear kernel to save memory and computational time.Similarly, the similarity values for Gallery person    are obtained with the whole Probe view as and  is the index of impostor person, and  ref Further, using ( 6)-( 10), CVI set Set C.V.I.
where  ̸ =  in Set CVI where  is the index of NGS Further, the set of NGS Set for all    persons in modal  is then obtained using (11).

Triplet Formation.
Getting the set of CVI Set C.V.I.
(   ,   ) and NGS Set Ng (   ,   ) for all    persons in modal  we will now generate triplet samples to learn metric   .Since the positive samples for each person   are too scarce compared to the number of negative samples, therefore, following the protocol of data augmentation in [49] we augment each person pair five times.Similarly, following the protocol in [39] we generate 20 triplets for each positive pair.Now, the triplet samples  imp  and  Ng  for person   using impostor  and negative Gallery  are given as where  and  are taken from respective sets Set C.V.I.

and 𝑇
Ng  , metric IRM3 for modal  is learned using MK-LFDA [44]; however, to save both the computational time and memory requirements we adapted [44] and use three RBF kernels and one  2 kernel.The weights for these kernels are learned globally for once for each dataset in our work using the similar method in [44].The reason to learn weights globally is to save both time and computational burden.Further, there is considerably minor effect on kernel weights; even the weights are learned globally.This is due to the fact that the global space is comprised of all the existing modals, and thus, all the modals contribute in learning the global weights.For learning weights of kernels all the extracted features are used individually, and the dimensions of these features are also individually reduced to 450 by CCA before learning weights.In all our experiments the obtained weights for VIPeR are 0.3, 0.22, and 0.22 for RBF kernels, while weight for  2 kernel is 0.26.For CUHK01 and CUHK03 the obtained weights for RBF kernels are 0.28, 0.24, and 0.24, while weight for  2 kernel is 0.24.The  values in all the datasets for the three RBF kernels are set to the mean value of modal , as well as (mean value + mean/2) and (mean value − mean/2).These values for  are chosen to model all the different variations in the modal , while the  value for  2 kernel is also set to mean value of modal .The mean value in our work is the similarity value between Probe and Gallery samples of center   .Finally, the metric   is learned as max where matrices   and   are obtained with similar method in [44].Now, ( 13) is then solved using generalized eigenvalue problem [50] in (14) to obtain first   = 300 eigenvectors corresponding to eigenvalues with largest magnitude as 2.7.Reidentification.From Figure 2, reidentification between test pair (   ,    ) is performed by first determining the transform modal the test pair belongs to using K-NN classifier.In K-NN classifier, the parameter K is set to the number of modals in the image space; that is, in VIPeR the value of K is set to the number of modals  = 7.Then, the features of (   ,    ) are projected into the weighted multikernel space of the respective modal, followed by the respective modal metric   to perform matching as

Experiments
Our IRM3 metric is evaluated on three benchmark datasets: VIPeR, CUHK01, and CUHK03.We follow the evaluation protocol of [33] for test/train split for VIPeR, CUHK01, and CUHK03 datasets.However, in our work we have tested CUHK01 for  = 486 only, while CUHK03 is tested for both Labelled and Detected settings.All the experiments are conducted in single-shot mode, and all the reported Cumulative Matching Curves (CMC) are obtained by averaging the results over 20 trials.

Experiment Protocols.
To thoroughly analyze the performance of IRM3 we have devised three evaluation strategies.These strategies evaluate IRM3 performance with different number of discovered modals  in , with Gallery view impostors (GVI) (GVI are the impostors from Gallery view only and are obtained in similar way as in previous conventional metrics [14,[20][21][22]), as well as Cross views impostors (CVI).
(i) IRM3 only: it is basic multimodal metric, learned with only Negative Gallery Samples (NGS).
(ii) IRM3 + GVI (  ): IRM3 is learned with impostors from Gallery view (GVI), as well as with NGS Here   refers to the number of impostors taken from Gallery view to form triplet samples and have values   = 5, 10, and 15, while the remaining triplets are formed using NGS (iii) IRM3 + CVI (  ): IRM3 is learned with CVI, as well as with NGS Here   refers to number of CVI samples used to form triplets and have values   = 5, 10, and 15, while the remaining triplets are formed using NGS All the samples from NGS, GVI, and CVI contain most difficult instances for a person and are randomly sampled offline, before training metric.In all the three strategies above, we have partitioned image space into  = 3, 5, and 7 for VIPeR, while, for CUHK01 we have used  = 6, 7, and 10 partitions, and for CUHK03  = 13, 14, and 16 partitions are used, respectively.

Results on VIPeR
Comparison with State-of-the-Art Features.Results of IRM3 metric are compared with three state-of-the-art features LOMO [11], GoG [25], and mom LE  [24] in Table 1.All the results in Table 1 are obtained for  = 7 modals, and our IRM3 + CVI (  = 15) has attained rank@1 52.81% and has outperformed all the three features of reidentification, providing evidence that if the metric can address multimodal transform variations well as well as have strong resistance against impostors then the matching accuracy can be improved.Our learned IRM3 + CVI (  = 15) considers optimizing all the rank orders simultaneously and, thus, has large improvement at rank@5 and rank@10.
Comparison with Metric Learning.We also compared metric IRM3 with 7 metrics.From Table 1 IRM3 + CVI (  = 15) has outperformed both multimodal metric LAFT [23] and impostor resistance metric LISTEN [21].The prime difference between IRM3 and [21,23] is its capability of addressing both the person modal transform, as well as capability of further maximizing the matching against joint constraint of cross views impostors.All these are the causes of great challenge in matching pedestrians.In Table 1 only SS-SVM [16] is a metric that tries to model the transform modal for each individual person; however, it never paid attention to acquire resistance against impostors and thus has 19.21% lower rank@1 accuracy than IRM3 + CVI (  = 15).Though IRM3 has successful results, still it has 1.36% lower rank@1 than SCSP [38].Obviously, VIPeR has large pose, misalignment, and body parts displacement issues which are specifically not addressed in our work and, thus, is necessarily needed to improve the matching and results largely.
Comparison with Deep Methods.Though, deep features (DF) and deep matching networks (DMN) have no match with conventional metric learning methods, however, from the results in Table 1 it is clearly evident if two major issues of reidentification (i.e., multimodal transforms, and strong rejection capability against impostors) can be well handled simultaneously, then comparable or even higher performance than deep methods can be attained.Our IRM3 + CVI (  = 15) has 7.1% and 4.94% higher rank@1 than Quadruplet-Net [33] and JLML [34], respectively.These obtained results demonstrate the fact that for smaller dataset like VIPeR deep matching networks have insufficient training samples to learn a discriminative network.At last, Figure 3 shows the comparison of retrieval results of two queries from VIPeR dataset for XQDA [11] and our IRM3 + CVI (  = 15) when  = 7 modals are used.Retrieval results of Query 1 for XQDA find the correct match at rank = 4 enclosed in green rectangle (b), while IMR3 finds the match at rank = 2 enclosed in green rectangle (e).Similarly, for Query 2 our IMR3 finds the match at rank = 1 enclosed in green rectangle (j); in contrast, XQDA finds the correct match at rank = 3 enclosed in green rectangle (h).Thus, our IRM3 approach improves matching, and consequently rank gets higher.

Results on CUHK01
Comparison with State-of-the-Art Features.Table 2 summarizes results of IRM3 for  = 10 modals and compares the obtained results with LOMO [11], GoG [25], and mom LE  [24].Though the three features are discriminative, however, our IRM3 approach is better than the three features in solving the two big challenges of Re-ID, that is, multimodal pedestrians matching and impostors resistance.Since CUHK01 has larger training set than VIPeR, thus, modal transforms can be well learned, and therefore, IRM3 + CVI (  = 15) attains larger discrimination than mom LE  [24].Our IRM3 + CVI (  = 15) has 15.15% higher rank@1 accuracy than mom LE  due to inherent virtue of handling different modals, person specific variations, and rejecting large number of impostors, all simultaneously.

Results on CUHK03
Comparison with State-of-the-Art Features.Table 3 compares LOMO [11] and GoG [25] features with our IRM3 metric in both Labelled and Detected settings.All the results in Table 3 are obtained for  = 16 modals.In Table 3, obtained results are much higher than the two features.The primary reason of gain in performance for IRM3 against the features [11,25] is mainly due to the difference in their approaches.In [11,25] a universal feature representation is proposed for all the different persons, which may not be optimal for all the persons at the same time residing on different modals; in contrast, our motivation is based on discovering distinct modals in the image space and then addressing each modal specifically with empowerment of large number of impostors rejection.Therefore, our IRM3 + CVI (  = 15) (in Labelled setting) has rank@1 accuracy of about 86.17%.
Comparison with Metric Learning.In Table 3, recently proposed WARCA [36] and SSM [43] are compared with our IRM3 approach.WARCA [36] differs with our IRM3 approach in a way that it only addresses hard negative samples, while SSM [43] differs in a way that it has no measure to account for different modal transforms, as well as having no resistance against impostors.Our IRM3 + CVI (  = 15) (in Labelled setting) has surpassed [36] and [43] and has attained 9.04% and 11.1% rise at rank@1 accuracy, respectively.
Comparison with Deep Methods.Interestingly, in Table 3 all the deep methods in Labelled and Detected settings have very high performance on CUHK03.These high results demonstrate the fact that CUHK03 is the largest dataset among all and, thus, can help in learning a more discriminative DMN.Even though both JLML [34] and DLPA [32] learn deep body features with global and local body parts alignment, as well as, pose alignment, however, our IMR3 approach benefitted with transform specific metrics empowered with impostors rejection still maintained to attain better results.Our IRM3 considers optimizing all the rank orders simultaneously and, thus, have large gain at rank@5 and rank@10 in Labelled setting.

Analysis.
In Table 4, we analyzed the effect of number of modals  in testing for VIPeR.Initially, we have partitioned image space into  = 5 and then tested it without using any impostor sample ( = 5,   = 0) to obtain rank@1 results of about 45.27%.As the more modals are discovered in the image space, such as  = 7, then the results get further improved even without using any impostor sample ( = 7,   = 0), and rank@1 becomes 45.92%.The main reason behind this increment is the fact that now we could match more test samples correctly by using their actual modal transforms which were lost when the modals are less discovered in  = 5.
In addition, we could also see a positive increment in results when impostors from Gallery view are also added in learning metric.Both ( = 5, GVI (  = 15)) and ( = 7, GVI (  = 15)) have attained more higher differentiating capability than [14,[20][21][22], as now, they can restrict impostors by taking into care transform modals a positive pair and impostors undergo.
Interestingly, in our work this impostor resistance can be further enhanced.This is done by using Cross views impostors (CVI).From Table 4, it is clear that even for same number of modals say  = 7, when (CVI) are used then the differentiation capability of ( = 7, CVI (  = 15)) gets further enhanced than ( = 7, GVI (  = 15)) and rank@1 becomes 52.81%.This increment in rank@1 provides a strong evidence that CVI have ability to maximize the similarity of positive pair more than GVI by taking into care both the transform modal, as well as various different changes a given query and Gallery samples undergo in different views.
At last, in Figure 4 we have provided a performance comparison at rank@1 when the modal centers are chosen randomly, as well as when the centers are obtained using our method in Section 2.2.Obtained rank@1 accuracy for random centers is poor, because these random centers are obtained just by simply choosing the top- persons without taking into care their reliability, stability, and IDs.
3.6.Efficiency.We computed the run time of our IRM3 approach using MK-LFDA [44], XQDA [11], and K-LFDA [45] (with  2 kernel) on CUHK03.There are 1260 training persons and 100 testing identities.All the algorithms are implemented in MATLAB and run on server machine having 6 CPUs (Xeon(r)e5-2620) with each CPU having 6 cores and   total memory size of 256 GB.In Table 5, training time of MK-LFDA [44] is faster than XQDA [11] but lower than K-LFDA [45].However, in testing when the weights of kernels are not learned MK-LFDA [44] is faster than both XQDA and K-LFDA.These timing results support the fact that our proposed method is well applicable in real time applications and in public spaces.

Conclusion
This paper presents a metric learning approach that exploits both multimodal transforms and Cross views impostors to improve the capability of metric to differentiate among different persons, as well as enhance rejection capability to decline large number of real world diverse impostors.In real world mostly pedestrian images are multimodal, and in public

Figure 1 :
Figure1: Three Modals  1 ,  2 , and  3 in Image Space.Query and Gallery lie in modal  1 , while one impostor for query lies in modal  2 and the other in modal  3 .Metric  1 is learned using the impostor from modal  2 , and  2 is learned using the impostor from modal  3 .Then, the obtained retrieval results of  1 and  2 are shown in Ranklist 1 and Ranklist 2, respectively.Correct Match is in green rectangle.

Figure 3 :
Figure 3: Two queries are shown, Query 1 and Query 2, and their retrieval results using XQDA [11] and our IRM3.Correct match is shown in green rectangle, while blue rectangle shows impostors.

Figure 4 :
Figure4: Performance at rank@1 when centers   are selected randomly, and when centers are selected with our approach provided in Section 2.2.
refers to the number of persons in a modal .Now, we compare each similarity value in these sets with the reference similarity value  ref (   ,   ) of a given pair (   ,    ) to obtain its CVI set Set C.V.I.
These obtained values   probe and   gallery for person (   ,    ) in modal  are then stored into two sets as Sim    = [  probe,  ]   =1,...,   , Sim    = [  gallery,  ]   =1,...,   ,(8)where ,   ) for all the persons in the modal  are computed.The computational cost of generating cross views impostors for a modal  is about O(3 ×    ), where    ≪ .2.4.Negative Gallery Samples (NGS).We have also used negative gallery samples (NGS) to learn metric   .Set of NGS, denoted as Set

Table 1 :
Top matching comparison on VIPeR.

Table 2 :
Top matching comparison on CUHK01.

Table 3 :
Top matching comparison on CUHK03.

Table 5 :
Run time comparison on CUHK03 (in seconds).