Learning Identity-Consistent Feature for Cross-Modality Person Re-Identification via Pixel and Feature Alignment

RGB-IR cross-modality person re-identification (ReID) can be seen as a multicamera retrieval problem that aims to match pedestrian images captured by visible and infrared cameras. Most of the existing methods focus on reducing modality differences through feature representation learning. However, they ignore the huge difference in pixel space between the two modalities. Unlike these methods, we utilize the pixel and feature alignment network (PFANet) to reduce modal differences in pixel space while aligning features in feature space in this paper. Our model contains three components, including a feature extractor, a generator, and a joint discriminator. Like previous methods, the generator and the joint discriminator are used to generate high-quality cross-modality images; however, we make substantial improvements to the feature extraction module. Firstly, we fuse batch normalization and global attention (BNG) which can pay attention to channel information while conducting information interaction between channels and spaces. Secondly, to alleviate the modal difference in feature space, we propose the modal mitigation module (MMM). Then, by jointly training the entire model, our model is able to not only mitigate the cross-modality and intramodality variations but also learn identity-consistent features. Finally, extensive experimental results show that our model outperforms other methods. On the SYSU-MM01 dataset, our model achieves a accuracy of and an of


Introduction
Person ReID can be viewed as a cross-camera image retrieval problem, which aims at matching individual pedestrian images in a query set to ones in a gallery set captured by di erent cameras. Its main challenge lies in the interclass and intraclass variations caused by di erent lighting, poses, occlusions, and views. Most existing methods [1][2][3][4][5] mainly focus on matching RGB images captured by visible cameras, which can be formulated as an image matching problem under a single modality. However, these methods cannot be applied to images taken in poor lighting conditions, because the visible camera cannot capture pictures with discriminative features. However, in practical application scenarios, the camera should ensure all-weather operation.
Since the visible camera has limited e ect on the security work at night, the camera that can switch the infrared mode is being widely used in the intelligent monitoring system. In visible mode and infrared mode, RGB images and infrared images are collected, respectively, which belong to two di erent modalities. RGB images have three channels but IR images have only one channel, so the ReID problem in a cross-modality setting becomes extremely challenging, which is essentially a cross-channel retrieval problem. First, infrared images of di erent identities are di cult to distinguish but are easy to distinguish in visible images. In addition, the same person varies greatly in di erent modalities. It is known as modality discrepancy.
To address visible-infrared person ReID, several approaches [6][7][8][9][10] have been proposed, aiming to mitigate modal differences by aligning features or pixel distributions. Feature alignment methods [6,8,10] mainly focus on bridging the gap between RGB and IR images through features. It is difficult to match RGB and IR images in a shared space due to large cross-modality differences between the two modalities. Different from existing methods that directly match RGB and IR images, we use generative adversarial networks to generate fake IR images based on real RGB images and then match the generated images through a feature alignment network. e generated fake IR images are used to reduce the modality difference between the RGB and IR images. Although the generated fake IR images are very similar to real images, there are still intraclass differences due to pose variations, viewpoint changes, and occlusions. Inspired by the above discussion, in this paper, we propose a pixel and feature alignment network (PFANet) that simultaneously mitigates cross-modality differences in pixel space and intramodality variation in feature space. As shown in Figure 1, to reduce the modal difference, we apply a generator (G I ) to generate fake IR images. en, to alleviate the intramodality variation, a feature extraction module (F) is designed to encode fake and real IR images into a shared feature space by exploiting identity-based classification and triplet loss. e batch normalization and global (BNG) attention is added to the feature extraction network (F), which can make the network learn which channel is more important as well as can interact between channels and spaces. Furthermore, to mitigate the modal difference in the feature space, a modal mitigation module (MMM) is proposed, which can significantly mitigate the difference between the two modalities. Finally, to learn identity-consistent recognition, a joint discriminator (D) is utilized. Its input is an image-feature pair.
e major contributions of this work can be summarized as follows:

RGB-IR Person
ReID. RGB-IR cross-modality person ReID can be seen as a multicamera retrieval problem that aims to match pedestrian images captured by visible and infrared cameras, which are widely used in video surveillance, public security, and smart cities. Compared with RGB-RGB single-modality person ReID which only deals with RGB images, the key challenge in this work is to mitigate the large differences between the two modalities. To address the challenge caused by differences in modality distributions, a variety of approaches to cross-modality person re-identification have been proposed. Some early work focused on solving the channel mismatch between RGB images and IR images, due to RGB images having three channels. In contrast, IR images have only one channel. Wu et al. [10] proposed a deep zero-padding network and contributed a new ReID dataset SYSU-MM01. In [11], a dual-path network with a bi-directional dual-constrained top-ranking loss was introduced to learn modality alignment  Figure 1: Framework of the proposed model. It consists of an image generation module (G), a joint discriminator module (D), and a feature extraction module (F). e G can generate fake IR images X ir ′ to mitigate the cross-modality variation, and the F can alleviate the intramodality variation. e F module contains ResNet-50 and BNG attention and MMM module. e BNG module can focus on channel and spatial information, and the MMM module can reduce modality differences. feature representations for RGB-IR ReID. Feng et al. [12] proposed a framework for solving heterogeneous matching problems using modality-specific networks. Ye et al. [13] proposed a dual-stream network with feature learning and metric learning to convert two heterogeneous modalities into a consistent space where the modalities share a metric. Dai et al. [6] introduced a cross-modality generative adversarial network (cmGAN) to reduce the distribution differences between RGB and IR features. Most of the above approaches mostly focus on reducing intermodality differences by feature alignment, while ignoring the large crossmodality differences in pixel space.
Unlike these approaches, the proposed model in this paper is able to combine feature alignment and pixel alignment, effectively reducing intramodality and crossmodality variations. By training the model, the model is able to learn identity consistency features.

GAN in Person ReID.
A generative adversarial network (GAN) consists of a generator and a discriminator, using the idea of game theory, where the generator tries to generate an image to deceive the discriminator, and the discriminator tries to discriminate whether the image is real or generated.
rough multiple adversarial training, generative adversarial networks are able to learn deep representations of data in a self-supervised manner. GAN can generate high-quality images, perform image enhancement, generate images from text, and convert images from one domain to another [14,15]. GAN was first proposed in 2014's [16]. After that, researchers have proposed a variety of task-specific GAN structures, such as CycleGAN [14], Pix2Pix [17], and StarGAN [15].
ere are many works in the field of pedestrian re-identification that also apply GAN to improve accuracy. Li et al. [18] proposed a network that allows querying images of different resolutions to process crossresolution person ReID. Wang et al. [19] designed an end-toend alignment generative adversarial network (AlignGAN) for the RGB-IR ReID task. JSIA-ReID [20] implemented a two-layer alignment of pixels and features in a unified GAN framework.
In our work, we apply GAN to generate cross-modality images that mitigate modal differences between RGB-IR image data in pixel space.

Attention Mechanisms.
ere is an important feature in the human visual system that allows people to selectively focus on things of interest in order to capture valuable information. Inspired by the human visual system, many works have attempted to employ attention mechanisms to improve the performance of CNNs.
Attention mechanisms enable the network to focus on areas of interest to the human body and better extract useful information. SENet [21]integrated spatial information into the channel-level feature responses and computed the corresponding attention with two MLP layers. Later, bottleneck attention module (BAM) [22] built independent space and channel submodules in parallel and embedded them into each bottleneck block. Considering the relationship between any two positions of the feature map, nonlocal feature attention [23] was proposed to capture the relationship between them. e convolution block attention module (CBAM) [24] sequentially cascaded channel attention and spatial attention. However, these works ignored the information about the weights adjusted from the training; therefore, we wanted to highlight the significant features by using the variance of the trained model weights, which also was able to amplify cross-dimensional interactions and captured important features of all three dimensions. We propose new attention (BNG) to solve the above problem. A modal mitigation module (MMM) is designed to mitigate the modal distribution, using channel attention to guide the learning of instance normalization (IN) for mitigating modal differences while preserving identity information.

The Proposed Method
In this part, we introduce the proposed PFANet in detail. Our network will be presented in the following three parts, including (1) RGB-IR images generation module, (2) BNG attention module, and (3) modal mitigation module. To reduce cross-modality variation, we apply generative adversarial networks to convert RGB images to fake IR images, which have IR style while maintaining their original identity.
en, the features of the two modalities are extracted for feature alignment. e BNG attention is designed to make the network focus on channel and spatial information. In addition, the modal mitigation module (MMM) is proposed to mitigate the differences between the two modalities. e main output of the PFAnet during testing is the feature for person ReID.

RGB-IR Images Generation Module.
ere is a large cross-modality difference between RGB and IR images, which significantly increases the difficulty of the task of cross-modality pedestrian re-identification. To reduce crossmodality variation, we apply generative adversarial networks to convert RGB images X rgb to fake IR images X ir ′ , which has IR style while maintaining their original identities. e generated fake IR image X ir ′ can mitigate the modality differences between RGB and IR images. e module consists of a generator G I that generates a fake IR image from an RGB image and a joint discriminator D I that discriminates whether the image is a real image or a generated image. e input of the generator is the real images X rgb , and its output is the fake IR images X ir ′ � G I (X rgb ). e input of the discriminator is the generated fake IR image X ir ′ ; if the image is real, its output is one, and if the image is the generated image, the output is zero. e goal of the generator is to make the generated image as similar as possible to the real image, and the goal of the discriminator is to discriminate as much as possible whether the input image is real or generated. Unlike ordinary discriminators, the input to our discriminator is a pair of IR images and ReID feature maps. e generator and discriminator play the min-max game as [16], and the modal can make the fake IR image X ir ′ as realistic as possible. e adversarial loss for generating IR images is defined as follows: where Among them, f X ir map,R is the extracted image feature of X ir and f X ir ′ map,R is the extracted image feature of generated image X ir ′ . Equation (1) is used to train the generator model; after the constraint of the loss function, the generator will generate a more realistic IR image. Equations (3) and (4) are used to train the discriminator, which differs from traditional discriminators in that the input is a pair of image features. It has two advantages, firstly, the fake IR image X ir ′ will be closer to the real IR image X ir through the max-min game [16], and the distribution of the features f X ir ′ map,R of the fake IR image will be more similar to the real image features f X ir map,R . Secondly, f X ir ′ map,R is able to maintain the identityconsistency by the corresponding image X ir ′ constraint. Although L G I loss can ensure that the fake IR image X ir ′ resembles the real IR image X ir , there is no guarantee that the generated fake IR images retain the structure and content of the original RGB images X rgb . To deal with this problem, we introduce a generator G R for generating IR images into RGB images and the corresponding discriminator D R . Also we introduce cycle-consistency loss which is defined as follows: L cyc loss enables the G I generated IR image to be consistent with the input real RGB image. We use the L1 norm instead of the L2 norm because the L1 norm allows the generator to generate better image edges. Specifically, we input the real RGB image X rgb into the generator G I to generate the fake IR image X ir ′ and then use the generator G R to generate the reconstructed RGB image from the fake IR image. We do something similar with IR images. Now, the loss of the generator can be defined as follows: where ω is the weight of cycle loss and ω is set to 10 as in [14]. By using this loss during adversarial training, we can generate high-quality IR images.

e BNG Attention Module.
Our proposed BNG attention is an efficient and lightweight attention mechanism. e BNG attention can be embedded at the end of any convolutional neural network, for the residual network ResNet-50; the end of the residual structure can be embedded. e structure of BNG is shown in Figure 2.
BNG attention consists of two submodules, as shown in Figure 2(a); the channel attention submodule can use the weight information of the trained model to highlight salient features. We obtain its scale factor from batch normalized (BN [25]) as shown in where μ B and σ B are the mean and standard deviation of mini batch B and c and β are the trainable parameters used to fit the data distribution. e formula for channel attention can be expressed as follows: where c is the scale factor for each channel, and the weights are obtained as W c � c i / j�0 c j . We measure the importance of each channel by applying the scale factor of BN to the channel dimension and suppressing insignificant features.
Since channel attention only focuses on channel information, there is no global space-channel information interaction; to solve this problem, we design a global attention module. It can reduce information attenuation and amplify the features of global dimension interaction. Inspired by CBAM [24], the channel attention and spatial attention are connected in turn. e main structure is shown in Figure 2(b). Given the input feature map F 1 ∈ R C×H×W , the intermediate state F 2 and output F 3 are defined as follows: where M c and M s are the channel and spatial attention maps, respectively. ⊗ denotes element-wise multiplication. e channel attention submodule uses a 3D arrangement to preserve information across three dimensions and then uses a two-layer MLP layer that amplifies the channel spatial dependencies across dimensions. e channel attention submodule is illustrated in Figure 3.
In the spatial attention submodule, to focus on the spatial information, two convolutional layers are used to fuse the spatial information. e size of the convolution kernel is set to 7 * 7. Since max-pooling reduces information and has a negative influence, we remove the max-pooling operation to retain more features. e same reduction ratio c is adopted from the channel attention submodule, same as BAM.
e spatial attention submodule without group convolution is shown in Figure 4.

Modal Mitigation Module (MMM).
To mitigate the modal distribution, a modal mitigation module (MMM) is designed. For the input image X, we denote the features extracted in the convolution block as M ∈ R h×w×c and input it into the MMM, where h, w, an d c represent the height, width, and a number of channels of the feature map M, respectively. e instance normalization (IN) is used to mitigate modal differences on a single instance [27]. Instance normalization (IN) computes the mean and variance in a single instance and reduces the difference between the two data distributions. However, using IN directly may has a negative impact on the ReID task. Because the distribution of image data has changed significantly, some identifying information may be lost.
To overcome these shortcomings, we use channel attention to guide the learning of IN, which mitigates modal differences while preserving identity information. Specifically, we input the feature into a two-layer MLP to downsample the channels and then upsample to the original number of channels and use the activate function to activate the feature as a mask to supervise the IN operation: where m C is the channel mask, representing the identityrelated channels, and M is the instance-normalized result of the input M. Similar to SENet [21], the method of generating a mask with channel dimension can be expressed as follows: where W 1 ∈ R c/r×c and W 2 ∈ R c×c/r are learnable parameters in the two bias-free fully connected (FC) layers, which are followed by ReLU activation function δ(·) and sigmoid activation function σ(·). g(·) denotes global average pooling of features. In order to balance performance and reduce the number of parameters, the downsampling ratio is set to r � 16. e formula for instance normalization is defined as follows: where E[·] is to calculate the mean of each dimension and Var[·] is to calculate the standard deviation of each dimension. To avoid dividing by zero, we add ϵ to the denominator, and M j ∈ R h×w is the j-th dimension of the feature map M.

Loss Function.
In this section, we will introduce the loss we used when training the generator to generate a fake IR image X ir ′ . On the one hand, X ir ′ should be classified to the same identity class as the corresponding X rgb ; on the other

Mobile Information Systems
hand, X ir ′ should satisfy the triplet loss [28] of the corresponding X rgb identity constraint. We define these two losses as L gan cls and L gan tri and denote them in L gan cls � L cls X ir where p(·) is the predicted probability of belonging to the ground-truth identity; the ground-truth identity of the fake IR image X ir ′ should be the same as that of the original RGB image X rgb . Although the generated image X ir can reduce crossmodality differences, there are still large intramodality differences caused by lighting, human pose, and view. We minimize the fake IR image X ir ′ and the real IR image X ir in a shared space via identity-based classification and triplet loss. We define these two losses as L feat cls and L feat tri and denote them in where p(·) represents the predicted probability that the input belongs to the ground-truth identity, and ∪ means the union sets. In summary, the overall loss of our module is shown in where L G and L D I are calculated by equations (1) and (2).

Datasets and Settings
. We evaluate our model on SYSU-MM01 [10]. SYSU-MM01 is a very popular RGB-IR ReID dataset; it contains pedestrian images captured by six cameras, including two infrared cameras (camera3 and camera6), and four natural light cameras (camera1, camera2, camera4, and camera5). For each pedestrian, there are at least 400 RGB images and IR images with different poses and viewpoints. Among them, 296 IDs are used for training, 99 IDs are used for verification, and 96 IDs are used for testing. Following [29], there are two test modes, i.e., all-search mode and indoor-search mode. For the all-search mode, all images are used. For the indoor-search mode, only use indoor images from 1st, 2nd, 3rd, and 6th cameras. Both modes employ single-shot and multishot settings, in which 1 or 10 images of a person are randomly selected to form a gallery setting. Both modes use IR images as probe sets and RGB images as gallery sets.
Evaluation protocols: we use cumulative matching features (CMC) and mean average precision (mAP) as evaluation metrics. Following [29], the results of SYSU-MM01 are evaluated using the official code based on the mean of 10 repeated random splits of the gallery and probe set.
Implementation details: we use the ResNet-50 [30] pretrained on ImageNet as the CNN backbone, use the output of its pool5 layer as the feature map M, and use the average pooling to obtain the feature vector V. We add BNG-attention to each layer of residual blocks in ResNet-50 and MMM module after the third and fourth layers. For triplet loss, we use the FC layer to map the feature vector V into a 256-dimensional embedding vector. For classification loss, the classifier takes the feature vector V as input and includes a 256-dim fully connected (FC) layer, followed by batch normalization [25], dropout, and RELU as the middle layer, and an FC layer with the identity number logit as the output layer. e dropout rate is set at 0.5. We use PyTorch to implement the model, the images are data augmented by horizontal flipping, and the batch size is set to 72 (9 people, each of which has 4 RGB images and 4 IR images). For the learning rate, the learning rate of the generation module and discriminator module is set to 0.0002 and optimized using the Adam optimizer. We set the classifier and the embedder to 0.2 and the CNN backbone to 0.02 and optimize them by SGD.

Comparison with the Other Methods.
In this section, we compare our method with several different cross-modality person ReID methods including the following methods: (1) with different structures and loss functions, two-stream [10], one-stream [10], zero-padding [10], BCTR [13], BDTR [13], D-HSME [26], and DGD + MSR [12] learned modality-invariant features and align them in feature space and (2) cmGAN [6] and JSIA [20] use the generative adversarial networks (GANs) to generate cross-modality IR images; they mitigate modal differences in pixel space. e experimental results are shown in Table 1.
In Table 1, we can find that there are various evaluation protocols, i.e., all-search/indoor-search and single-shot/ multishot; firstly, for the same method, indoor-search performs better than all-search, because the images have less background variation in indoor mode, and matching is easier. Secondly, the rank scores of single-shot are lower than ones of multi-shot, but mAP scores of single-shot are higher than ones of multishot. is is because, in multishot mode, there are ten images in the gallery setting, while in single-shot, there is only one image. As a consequence, under the multishot mode, it is much easier to hit an image but difficult to hit all images. is situation is inverse under the single-shot mode. e R1, R10, and R20 denote Rank-1, Rank-10, and Rank-20 accuracy (%). e mAP denotes the mean average precision score (%), and our model shows good performance. Compared with JSIA, our model achieves over 2.7% on Rank-1 and 2.49% on mAP in the single-shot setting of all-search mode. In the single-shot setting of indoor-search mode, our model achieves a rank-1 accuracy of 44.0% and an mAP of 52.96%. In the multishot setting of indoor search, our model achieves a rank-1 accuracy of 53.40%, and an mAP of 44.35%, which is higher than JSIA by 0.7% and 1.65%, respectively.

Ablation Study.
In this section, we design ablation experiments to test the effectiveness of the BNG module and MMM module. Our ablation experiments are performed on the dataset SYSU-MM01 and use the single-shot setting of all-search mode.
Influence of BNG module: the results of ablation experiments for BNG attention are shown in Table 2. Compared with the baseline model (B), by adding BNG attention, the rank-1 accuracy and mAP are improved by 5.57% and 4.39%, proving the effectiveness of BNG attention.
Influence of MMM module: as shown in Table 2, the model with MMM (B + MMM) achieves a rank-1 accuracy of 39.97% and an mAP of 39.52%, which are higher than those of the baseline (B) by 5.84% and 5.98%, respectively. It is proved that our proposed MMM module has good performance.

Visualization of Generated Images.
For a more intuitive understanding of the generator model, we show the learned fake IR images in Figure 5. As shown in Figure 5, the first row is the real RGB image, the middle is the fake IR image generated by the generator, and the last row is the real IR image. We can observe that fake IR images have similar content (e.g., pose and view) and maintain the identity of the corresponding real RGB images while having an IR style. erefore, the generated fake IR images can bridge the gap between RGB and IR images and can reduce cross-modality variation in pixel space.

Conclusion
In this paper, we proposed a new pixel and feature alignment network (PFANet) for the RGB-IR ReID task. e model consisted of a feature extractor, a generator, and a joint discriminator. e BNG attention and the MMM module were designed in the feature extraction module. rough these two modules, the model not only mitigated modality differences but also paid attention to channel and global  information. e cross-modality IR images were generated by the generator, which could bridge the gap between RGB and IR images and reduce cross-modality variation. Ablation experiments verified the effectiveness of each module. Extensive experiments on the SYSU-MM01 dataset illustrated that our model achieved state-of-the-art performance.
Data Availability e SYSU-MM01 data used to support the findings of this study have been deposited in the "Rgb-infrared cross-modality person re-identification" repository (http://isee.sysu. edu.cn/project/RGBIRReID.html).

Conflicts of Interest
e authors declare that they have no conflicts of interest.