IIPA-Net: Joint Illumination-Invariant and Pose-Aligned Feature Learning for Person Reidentification

Person reidenti ﬁ cation (re-id) has gained signi ﬁ cant progress and aroused great interest in computer vision. However, due to the e ﬀ ect of weak illumination and poor alignment, person re-id is still a challenging task. Many previous works focus on either illumination enhancement methods or pose estimation. However, those methods are di ﬃ cult to apply in real-world scenarios, which usually contain various interference factors. To improve the performance of re-id, we propose an Illumination-Invariant and Pose-Aligned Network (IIPA-Net). The illumination change is handled by a retinex decompose network, and the pose variation problem is solved by a local feature matching method. Based on the multimodal nature of a person, we propose a part attention module to optimize the global feature. Finally, a data-driven training strategy is proposed to train the proposed architecture e ﬀ ectively. Experiments show that the proposed framework outperforms other state-of-the-art approaches on both normal- and low-light datasets.


Introduction
Person reidentification (re-id) is aimed at identifying a specific person (probe query image) from a gallery of candidate images captured by multiple cameras with overlap or nonoverlap fields of view. The increasing need for safety and security, combined with the growing availability of surveillance cameras, makes person reidentification an increasingly explored area [1]. However, it is very challenging since the interest person images captured by surveillance cameras usually have significant variations in different viewpoints, illumination, human pose, and so on [2]. Low resolution, partial occlusions, and blurring increase the difficulty of person re-id [3].
Since person images are captured by different cameras under unknown lighting conditions, the appearance of the same person contains various variants, making the re-id task extremely difficult. In order to eliminate the effect of illumination, many methods rely on the statistics of color distribution and project image to color constant space [4]. However, the prior information of lighting is unpredictable in realworld scenarios. An alternative solution is to simulate the real-world illumination and use data augmentation techniques, which is expensive and needs a lot of labeled data [5]. Pose misalignment, which is caused by changed viewpoint or inaccurate detection boxes, is another interference of person re-id framework [6]. A straightforward solution to this pose variation is to apply human pose estimation, which parses a person image into different semantic parts. However, pose estimation requires massive labeled data to train the model [7]. What is more, the re-id accuracy degrades substantially for inaccurate estimation. Figure 1 shows some examples of illumination change and pose misalignment.
Convolutional neural networks (CNNs), which have powerful representation and invariant embedding capabilities, have boosted the performance of person re-id [8]. CNN-based person re-id methods can be divided into two aspects: discriminative feature representation learning and deep metric learning [9]. In the first category, majority of the methods generally concentrate on extracting discriminative features, then formulate the person re-id as a classification problem [10]. In the second category, a robust metric between positive (the same) and negative (the different) persons is learned to deal with the matching problem [11]. In this paper, we focus on extracting discriminative feature representation. To achieve this aim, we propose a joint CNN framework that couples global and local feature learning to suppress interference, especially illumination and pose variations. Firstly, motivated by deep retinex illumination decomposition [12], we adopt a lightweight estimation to eliminate the effect of illumination and enhance the global person feature. Secondly, inspired by AlignedReID++ [13], which aligns local information to learn more discriminative features, we introduce a local feature matching to align different parts of person image, which is able to solve the pose variation problem. We find that the illumination-invariant feature can guide the local feature matching to align different person image parts. Thirdly, since the detected person has two significant modes [14], we concatenate the low-level feature of CNNs and the two-peak Gaussian map to design an attention mechanism. Consequently, the proposed IIPA-Net can boost the performance of the re-id in both normal-and low-light datasets. In summary, the contributions of this paper are threefold: (i) We build a novel network framework, which contains a retinex decomposition net and a weightshared Resnet50 backbone CNN and achieves illumination-invariant and pose-aligned re-id (ii) We propose a part attention module to reweight the CNN output and extract the most informative parts of a person (iii) A data-driven training strategy is introduced to train the network effectively and speed up the training process

Related Work
The main challenges of reidentification are changes in illumination, viewpoint, and pose across cameras. Many works focus on extracting the most discriminative visual feature of a person, including color [14], texture [15], and shape [16]. Kviatkovsky et al. [14] use shape context descriptors as a color-based signature to represent a person, which is divided into two significant modes. However, they assume that the silhouette of a person can be always obtained, which is not the case in real-world applications. Deep learning has revolutionized the techniques for person reidentification [17]. Li et al. [18] successfully apply deep learning to extract the features for person reidentification. Xiao et al. [19] propose a new deep learning framework that jointly handled both person detection and reidentification in a single convolutional neural network. Wu et al. [20] improve the discriminative feature representation of CNNs by exploiting unlabeled tracklets. The major limitation of this framework is that they either have handcrafted features or employ single scene images, thus making them less robust to various lighting conditions and changed human pose. Retinex theory is widely used for illumination estimation [21]. Many retinex-based re-id algorithms had achieved competitive performances [22,23]. Specially, Liao et al. utilize the retinex transform and a scale invariant texture operator to handle illumination variations [23]. Huang et al. propose a retinex decomposition network to address the illumination variation problem and achieved a competitive reid performance in low-light condition [22]. In [24], a new synthetic dataset, which contains hundreds of illumination conditions, is introduced to simulate the realworld lighting. The above methods reduced the adverse effects of illumination variant. However, they ignore the matching of local feature and failed to learn the aligned information, which effectively eliminate the influence of pose variant.   Journal of Sensors To reduce the negative impact of pose variant, some works apply human pose estimation to extract pixel-level body regions [8,25]. Zheng et al. adopt the pose estimation confidence of input image to build a pose-invariant embedding (PIE) descriptor [8]. In [25], Zhao et al. represent a person with a discriminative feature, which is learned from different semantic regions of a person. On the other hand, some works focus on utilizing horizontal stripes or grids to extract pose-invariant features [13,26]. Sun et al. design a Part-based Convolutional Baseline (PCB) network to learn discriminative part-level features [26]. Using the dynamic programming to match horizontal stripes of person images, Luo et al. propose a deep model to address the misalignment issue [13]. Additionally, Miao et al. propose an occluded person re-id framework by incorporating the pose information [27]. In spite of the great progress in re-id performance, the above methods still could be optimized by integrating the advantages of different architectures.
Different from existing frameworks, we focus on addressing issues of illumination and pose change simultaneously. Then, we propose a novel framework that is able to learn illumination invariance and pose alignment in a multitask manner.

Methodology
In this section, we firstly describe the retinex decomposition net and the part attention module. Then, the details of the proposed structure and training strategy are introduced.

Retinex Decomposition Net.
To simulate the human color perception, retinex theory decomposes the observed image into two components: reflectance and illumination [21]. Mathematically, the source image S can be denoted as follows: where R and I represent the reflectance and illumination components, respectively, and°represents element-wise multiplication. The reflectance map described the intrinsic person property and is invariant to light change. Thus, it is active to extract illumination-invariant discriminative features from the reflectance map. The illumination map, which represents various light environments, is harmful to re-id performance and ignored in this paper.
Unlike deep retinex net [12] that performs both reflectance and illumination decomposition to enhance low-light images, we only perform retinex decomposition net to extract the consistent feature of a person. As shown in Figure 2, the retinex decomposition net includes 8 layers. The first layer is a 3 × 3 convolutional layer, which extracts convolutional features from the input image. The second to sixth layers are 3 × 3 convolutional layers with a Relu activation function. The seventh layer is a 3 × 3 convolutional layer which maps R and I from feature space. The last layer is a sigmoid function that normalizes R and I to ½0, 1.
To extract R from different lightness images, the decomposition network is fed in paired normal/low-light images each time. During the training stage, the paired images, instead of their corresponding ground truth, are taken to train the retinex decomposition net. However, it can predict R and I in the test stage.
The loss L R for retinex decomposition net consists of reconstruction loss L recon and invariable reflectance loss L ir : where λ ir is used to balance the consistency of reflectance. The reconstruction loss L recon is defined as where S low and S normal denote the input low-light and normal-light images, respectively. R low and I low denote the reflectance and illumination of S low , as well as R normal and I normal of S normal . The invariant reflectance loss L ir is defined as 3.2. Part Attention Module. In order to extract discriminative features, many re-id methods introduce the attention mechanism to highlight the informative parts of person images, while suppressing cluttered background [9,28]. The goal of the attention mechanism is to produce a saliency map to reweight CNN output. Given a 3-D X ∈ ℝ C×H×W , where C, H, and W indicate the number of pixels in the channel, height, and width dimensions, respectively, the reweight process can be formulated as where Y is the reweighted map and AðXÞ is the output of the attention module. Combined with the state-of-the-art detector, there is an intuitive assumption that the detected persons lie in the middle of images. In real-world scenarios, a person usually has different clothing for lower and upper parts. Based on their multimodal nature, we introduce a two-peak Gaussian map M f , defined as Equation (6), to deal with the intradistribution of person appearance: where μ 1 = ½H/3, W/2 and μ 2 = ½2 × H/3, W/2 represent the peak centers of the Gaussian map. As shown in Figure 3, we concatenate M f and the 4th layer of Resnet-50. Subsequently, six 3 × 3 convolution layers are added to extract the discriminative feature. Finally, a softmax classifier is implemented with a Fully Connected (FC) layer.
3.3. IIPA-Net Architecture. As shown in Figure 4, the proposed IIPA-Net can be divided into two parts: global branch and local branch.

Journal of Sensors
For the first branch, the most discriminative image parts of a person are extracted by the part attention module. In the second branch, the person images are enhanced by preserving the reflectance map of retinex decomposition net. Both of the two branches are sent into the weight-shared Resnet50 backbone CNNs, which makes the proposed model more flexible and easy to train. The output of Resnet50 is aℝ c×h×w feature map, where c represents the feature channel and h × w is the spatial size. We extract a global discriminative feature vector I ∈ ℝ c×1 using Global Average Pooling (GAP). Then, the global feature distance can be calculated by where I A and I B denote the global feature of images A and B.
The global feature is able to learn holistic information from the person image. However, it fails to address the posemisalignment issue for the reason that the local representation is still unexploited. To learn the pose-aligned local feature, the output feature map of Resnet50 is transferred into c × h size using horizon horizontal average pooling.
,⋯,p h B g denote the local feature of images A and B. We can have the distance of the ith vertical part of A and jth vertical part of B as follows: We further have the distance matrix D, where its elements are dði, jÞ. As described in [13], the local pose-aligned feature distance d p ðA, BÞ can be derived by dynamically matching local information (DMLI), which could dynamically align different part features. Finally, we obtain the total distance of A and B by The total loss function of the framework is where L ID and L I T denote softmax loss and triplet loss [29] of the global feature and L P C denotes the circle loss [30] of the local pose-aligned feature. The performance of different loss functions is described in Section 4.3.

Training the Network.
Since there is a lack of explicit ground truth for the training part attention module and retinex network, it is difficult to optimize the network for various scenes. Therefore, we try to train the network in a date-driven way. The whole network is trained in four stages, as illustrated in Algorithm 1.    [22], we use gamma correction to simulate low-light conditions. Each image in the datasets is processed with a gamma value, which is randomly picked from f1, 2, 3, 4g. Figure 5 shows examples of synthetic low-light images. To evaluate the performance of different algorithms, we use Cumulative Matching Characteristic (CMC) curves and mean Average Precision (mAP) [32] as the evaluation criteria. CMC is defined as a function of Rank-r [35].
where jP g j represents the total number of person images in the gallery, and the query set CðrÞ is defined as mAP is calculated based on the Average Precision (AP) and defined as where APðkÞ represents the precision-recall curve area of the kth query and n represents the size of the query set.  1. The shared-weights Resnet-50 is trained to convergence with triplet loss. 2. All synthetic images, together with their original images, are fed into the Retinex decomposition network. 3. Parts attention module is trained using the training images set. 4. The whole network is fine-tuned with Equation (10).
Algorithm 1: Training steps of the proposed network. 6 Journal of Sensors training patch size is set to be 32; λ ir is set to be 0.001. λ ij is set to 1, when i = j. Otherwise, λ ij is 0.001. Each input image is resized to 256 × 128. Random horizontal flipping and cropping tricks are preformed to augment data. We use Adam optimizer with learning rate 10 −4 .

Experimental Results.
In this subsection, we firstly evaluate the part attention module. The two-peak Gaussian map can better guide the main body information of a person. Then, the effect of low light is analyzed. We can see that the low-light condition has a negative impact on pose alignment. Then, we evaluate the performance of our proposed IIPA-Net compared with other state-of-the-art re-id methods.  To better illustrate the effect of the proposed part attention module, we visualize the attention maps of the model with normal and twopeak Gaussian maps. In Figure 6, we can observe that the two-peak Gaussian map can pay attention to both upper and down parts of a person, while the normal one only to either upper (Figure 6(a)) or down (Figure 6(b)) part. The introduction of two-peak Gaussian makes part attention work more effective with the multimodal nature of a person. Figure 6 third columns show that the proposed part attention is able to produce similar predicted attention under different light conditions. Figure 7(a), using AlignedReID++ [13] as the baseline model, the fifth block of the left image is aligned to the fourth and sixth blocks of the right image and the distance of the two images is 0.7333, which is greater than the negative pair (0.5557). However, after decomposing the illumination, our proposed method is able to align the head, chest, foot, etc., of the positive pair images, and the distance is reduced to 0.4195, which is less than the negative pair (0.5775), as illustrated in Figure 7(b). The wrong connections of the baseline can be attributed to the negative impact of the low illumination. This indicates that the proposed approach eliminates the effect of weak illumination and learns the illuminationinvariant features.

Performance of Different Loss Functions.
We train four models with softmax+triplet loss (L ID + L I T + L P T ), softmax +instance [36] loss (L ID + L I I + L P I ), softmax+circle loss (L ID + L I C + L P C ) and the proposed loss. The performance on Market1501 is presented in Table 1. L I and L P represent the loss of the global and local features, respectively. We can observe that Softmax+Instance and Softmax+Circle loss achieve the similar Rank-1 accuracy. Compared with Soft-max+Triplet, the proposed loss improves the Rank-1 and mAP arropminately 0.3 and 0.2, respectively. We believe that the Circle loss works on some hard local features.

Comparison with State-of-the-Art.
To evaluate the performance of the proposed IIPR-Net, we report the experimental results with some state-of-the-art methods. Our baseline is AlignedReID++ [13], which focuses on solving the pose change problem. In order to demonstrate the advantage of the proposed framework, we also report the results of baseline with a low-light enhancement method. Both training and testing image sets are enhanced with MSRCP [37] and then fed into the baseline.
As shown in Table 2, our proposed framework outperforms most state-of-the-art methods on all four datasets. Specially, the proposed framework achieves 96.2% Rank-1 for Market1501 and 90.8% Rank-1 for Duke MTMC-reID, outperforming other attention-based methods, i.e., MHN-6 [9] and DSA [38]. Although FlipReID [39] and st-ReID [40] achieve the best performance, the extra data, for instance, spatial and temporal information, are utilized to train the network. For low-light Market and Duke datasets, the Rank-1 accuracy of the proposed method is increased by 10.1% and 11.2%, and the mAP increased by 9.5% and 6.0%, respectively. This demonstrates that our joint framework not only eliminates the impact of low light but also explores pose-invariant local features for person re-id. Figure 8 depicts five examples of queries together with the top 10 retrieved results of baseline and IIPA-Net on the low-light Market dataset. As we can see, the IIPA-Net

Ablation Study.
To verify the contribution of each component, we perform the ablation study on normal-and low-light Market datasets. Table 3 shows the results of each component of IIPA-Net. We note that the attention component achieves better results on the Market1501 dataset. However, retinex is better in low-light conditions. The combination of the retinex and attention achieves the best performance on both datasets. The reason is that IIPA-Net is able to learn both illumination and pose-invariant features.

Conclusions
In this paper, we proposed a jointly illumination-invariant and pose-aligned learning framework for person re-id. Motivated by retinex theory, we introduce a retinex decomposition net to eliminate the impact of different lights and extract an illumination-invariant feature. To tackle the problems of pose alignment, dynamically matching local information is utilized to align local feature, which is transferred from the deep learning feature map. Based on the nature of a person, we proposed a part attention mechanism to extract the most discriminative global feature. The joint framework is trained in a four-stage fashion. Experiments demonstrate that the proposed framework achieves better performance on both normal-and low-light datasets. In the future, we will focus on long-term re-id scenarios which present more complex scene variations.

Data Availability
All data included in this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this article.