CNN with Embedding Transformers for Person Reidenti ﬁ cation

,


Introduction
Person reidentification (ReID) is a difficult problem in the field of computer vision and is often considered a retrieval problem as well. Given the pedestrian probe to be retrieved in the query set, we need to find the same pedestrian under different cameras in the gallery. Most previous works mainly tend to utilize convolutional neural network (CNN) combined with classification loss and metric loss [1] to complete the corresponding classification and recognition tasks. This global feature-based method is difficult to solve some problems such as pose, viewing angle, light intensity, occlusion, etc. (as shown in Figure 1). In order to make the model focus on more local features, some works [2][3][4][5] use the pose estimation network to obtain the key point information of the human body so as to use the key points of the human body to divide the parts and let the network focus on the learning of the part features. However, this method requires more graphic processing unit (GPU) resources. Other methods [6,7] use the horizontal block slice method to extract the component features of pedestrians, but these methods all need to align the component features in order to improve the performance of the model. Although partbased convolutional baseline (PCB) [8] is not aligned, it directly divides the human body parts in the feature map equally and performs global average pooling (GAP) on each equally divided part, then uses fully connected (FC) and softmax to directly classify each pedestrian ID. But this approach will eventually lead to pixel-level feature crossaliasing (as shown in Figure 2) [9]. Utilize dynamically matching local information (DMLI) to solve the alignment problem. Besides, Kalayeh et al. [10] utilize segmentation and channel attention mechanism to solve the occlusion problem.
Different from the abovementioned methods: horizontal slicing, component alignment, alignment algorithm, attention mechanism, etc., the convolutional network can pay more attention to and learn more robust features. But the CNN itself has a certain sliding window, which is also limited for shallow feature extraction, and leads to aliasing between subsequent features (such as in PCB). Because of transformer [11] structure based on self-attention mechanism is good at capturing correlation between shallow and deep features, we introduced it into the field of person ReID. A CNN-based embedding transformer (CET) architecture is proposed. In this architecture, the embedded residual transformer (RT) structure is used to model the shallow features extracted by CNN so that each pixel-level feature (token) has global information in the shallow layer. That is conducive to the subsequent feature redirection. For the redirected feature map, the horizontal slice and GAP are used to obtain the features of each component. An external transformer structure with class token is further embedded to model the correlation between each component. By using transformer structure, more useful information is fused into the class token vector. In addition, this paper also adds a feature fuse with learnable vector (FFLV) structure to further strengthen the learning of features.   Figure 3, we calculate its cosine distance with p i ði ¼ 1; 2; ::; 6) among P t ; in order to compare with the PCB, we take n = 6, and cut F into six horizontally slice by using part-based global average pooling (PGAP). In a word, we calculate the cosine distance of each p i ði ¼ 1; 2; ::; 6) with the PGAP vector of each feature map, and display it to the color of the closest part. PCB, part-based convolutional baseline.

Mathematical Problems in Engineering
The final results show that the CET architecture does greatly alleviate the pixel-level feature cross-aliasing that exists in PCB and other horizontal slicing methods. Besides, CET achieved better performance in Market1501 [12], DukeMTMC-ReID [13], and CUHK03 [6] datasets. In conclusion, the contributions of this paper are as follows: (1) A new transformer-based network structure CET for image-based person ReID is proposed, which can not only grasp the global information of the low-level features of the shallow network but also fuse the deep high-level semantic features globally, which greatly improves the efficiency of feature learning. (2) A novel RT structure is designed, using double residual connections to further strengthen the network's global grasp of shallow features. (3) FFLV is proposed to perform weighted summation of the encoded output vectors of the transformer so as to perform feature fusion more efficiently and use two branches loss (TBL) to balance and constrain the network.

Deep Person ReID.
In recent years, due to the development of deep networks, deep learning-based person ReID methods achieved surprising performance [14][15][16]. The methods can be roughly divided into two categories: representation learning and metric learning. Representation learning positions the problem of person ReID as a classification problem and uses softmax and cross-entropy loss (ID loss) to train in the network. For example, the ID embedding network [17,18] takes the pedestrian ID as a classification problem, and how many pedestrians in the dataset correspond to how many classification categories [19,20]. Combined validation loss and ID loss, which achieved better performance on CNN deep networks. Metric learning methods [15,[21][22][23][24][25][26][27] first extract features from raw images by using a deep network and then calculate the distance between features to measure their similarity. Triplet losses [15,21] are used to shorten the distance of positive sample pairs while pushing the distance of negative sample pairs. In this way, different pedestrians can be distinguished [26,27]. Combined softmax loss for representation learning and metric learning loss to accelerate model convergence.

Local Feature Alignment.
In recent years, more and more works [2][3][4][5][28][29][30][31] have begun to focus on the learning of local features. Besides, the pedestrian's body pose is also an important clue for person ReID tasks. For example, Zheng et al. [2] align the pedestrian's standard posture by pose invariant embedding, thereby mitigating the influence of the pose bias. Global-local alignment descriptor [4] utilizes human pose key points to extract features of specific regions of the human body. SpindleNet [2] uses the region part generation network to generate the regions of various parts of the human body and combine the corresponding regions in the feature map to extract features. However, these methods all require additional posture annotation information to assist, which not only increases labor costs but also may introduce wrong posture annotation information.
On the other hand, some works [6-8, 30, 32, 33] enhance the performance of the model by horizontally partitioning and extracting the features of each block separately. Li et al. [6] and Cheng et al. [7] directly align the parts under the premise of pose alignment, which turns out to be an effective method as well.

Transformer and Person
ReID. The attention mechanism focuses on important features while ignoring irrelevant features. The introduction of the attention mechanism can well solve problems such as occlusion and misalignment in person ReID, because the attention mechanism will assign weights to each pedestrian component automatically. Occlusion parts are assigned low weights, while similar weights are learned for identical human parts that are not aligned. Methods based on CNN have dominated the ReID field for many years. However, some pure Transformer methods, as well as hybrid CNN-transformer models, have also become popular in ReID recently.
In 2017, transformer [11] began to emerge in the field of natural language processing (NLP) and continued to develop in the follow-up. NLP models such as generative pre-trained transformer [34] and bidirectional encoder representation from transformers [35] came out one after another, which proved that the transformer's powerful representation ability for the correlation between sequence tokens. Based on this, Google's ViT [36] also successfully introduced the transformer structure into the image field to model the correlation between image patches. The results show that the transformer's performance in the visual field is still comparable to that of CNN. TransReID [37], as the first method to apply the ViT architecture to the field of person ReID, has achieved good performance in both vehicle ReID and person ReID. AAformer [38] also applies the ViT backbone network and additional part tokens (Pt) to characterize and aggregate pedestrian part information. LA-transformer [39] replaces the PCB backbone ResNet-50 with transformer and fuses the global token into each human body part. In addition, some other works [40][41][42] combine transformer and CNN so that some hierarchical features and local features are aligned.

Transformer Introduction.
In general, transformer [11] is mainly composed of the following main parts.
f t 2 ;…; t n g. Each vector t i in T IN is multiplied by a set of three learnable matrices W q ; W k ; W v to obtain three output matrices Q; K; V. Then, for each q i i ¼ 1; ð …; nÞ in Q and k 1 ∼ k n in K, we calculate the self-attention dot product. In order to prevent the dot product from too large, which would cause problems with subsequent gradients, a scaling factor ffiffiffi d p is added here, in which d is the dimension of q i or k i . In Summary, the encoded output vector of this group can be represented by the following formula:

Mathematical Problems in Engineering
Similarly, all other outputs o k k!¼i ð Þ can be obtained by performing the same weighted sum operation on other selfattention. The previous process can be further abstracted and written as the following matrix representation: 3.1.2. Multihead Self-Attention. If only a set of W q ; W k ; W v are used, the attention output is inevitably too single to adapt to the practical and complex application environment. In different environments and different tasks, the model needs to be able to notice the correlation between different parts. Therefore, it is necessary to set up multiple sets of W q ; W k ; W v matrices to represent complex actual situations, and the multihead attention mechanism can be represented by the following formula: Among them, Z OUT is the output matrix of multihead attention, which is the same dimension as the input matrix.
In a word, the previous h groups of W q ;W k ;W v matrixes will obtain h group of outputs respectively. We perform concatenation on the output matrixes of the h groups and then multiply it by a matrix of W Z , so as to obtain Z OUT , the encoded output of the same dimension as the input T IN .

Multiple Layer Perceptron (MLP), Residual Connection,
and LayerNorm. Just like residual connection was introduced to solve the gradient disappearance problem caused by too deep network in CNN, residual connection was introduced in every transformer block. In addition, transformer also consists of simple MLP and LayerNorm, as shown in Figure 3 Among them, d is the dimension of input vectors; pos 1; ð 2; …; nÞ indicates the position of the corresponding vector in the component block, while i (i = 0, 1, 2, …, n) represents the specific dimension.

Model
Overview. The diagram of CET is shown in Figure 3, and it can use arbitrary CNN as the backbone network. In CET, we embed a new feature extractor transformer in CNN (TIC) by combining ResNet-50 and Transformer. First, the input image is passed through TIC to obtain a feature map with global information. Second, the feature map is subjected to horizontal slice; then, we perform part-based global average pooling (PGAP) to obtain the block component features Pt. After Pt, a linear projection block is added. Then, by adding the global classification feature vector called global token embedding (GTE) vector and position encoding (PE), we obtain Z. Then, Z will be put into transformer encoder, so as to get the output O. Finally, we design a FFLV block, which uses matrix multiplication to weight the output encoding vector to obtain the final oneway classification token vector f. Another way of classification token vector is taken from the global token vector O 0 encoded by the transformer encoder. The two-way vectors pass through TBL block to classify the pedestrian.
3.3. Structure of TIC. In TIC (as shown in Figure 4), in order to allow the CNN backbone network to extract features with global information in the shallow layer, we embed RT module, which uses the embedded transformer to model the correlation between global information, so as to alleviate the subsequent feature aliasing phenomenon.
By letting the input pedestrian image be X X 2 R H×W×3 ð Þ , and considering the size of the feature map, we only embed an RT between conv2_x and c onv3_x and between conv3_x and conv4_x in ResNet-50, respectively. Before embedding the first RT structure, we split the CNN backbone (ResNet-50) into two parts: CNNprelayer (conv1 and conv2_x) and CNNnextlayer (conv3_x, conv4_x and conv5_x). First, we use the following formula to get the input feature map of RT Then, G in pass through Tokenizer in RT to get the input of Transformer1 T in T in 2 R L×C t ð Þ . Tokenizer can be given by the following formula: where T pre ;T t 2 R L×C t ; W A 2 R C t ×L . Random initialization is used in the first T pre in RT. Then, T in pass through Trans-former1: In previous formula, we get T out T out 2 R L×C t ð Þ . The reconstructed features G out G out 2 R H t ×W t ×C t ð Þ are obtained through Projector, which is given by the following: FIGURE 4: Structure of TIC. TIC embeds an RT structure between conv2_x and conv3_x, conv3_x and conv4_x of ResNet-50, respectively. In the RT structure, first, the tokenizer maps the feature map Gin to be token vector Tt; then, it uses the transformer to get the output token vector Tout; finally, it uses the projector to reconstruct the feature map to be Gout.
where W Q ;W K 2 R C t ×H t W t ; G out 2 R H t ×W t ×C t . G out is the output reconstructed feature map. In order to capture the global information on the shallow feature map better, we add double skip connections in the RT, which are previous Equations (9) and (11), respectively. Similarly, a second RT can be embedded between conv3_x and conv4_x of ResNet-50. Finally, after passing through CNNnextlayer (conv4_x and conv5_x) and PGAP, the output of TIC P t is obtained.
Among them, P t 2 R n×C . (12) uses PGAP to horizontally slice the feature map, and after linear projection, we get P T P T 2 R n×C 2 ð Þ . Let P T ¼ p 1 ; f p 2 ; p 3 ; p 4 ; p 5 ;…; p n g, then we add the global classtoken cls ð Þ classification vector for feature fusion in subsequent transformer encoder module.

FFLV and TBL. Formula
In order not to lose the spatial position relationship of local features, one-dimensional position encoding vector PE is added to the previous P. The specific formula is as follows: Z ¼ P þ PE ¼ p 1 þ e 1 ; p 2 þ e 2 ; …; p n þ e n ; cls þ e nþ1 f g : Then, the previous Z is inputted to the transformer encoder module to get the encoded output: O ¼ Transformer Z ð Þ. Finally, the encoder learns an output class token vector O 0 O 0 2 R C 2 ×1 ð Þthat combines the key features of p 1 ∼ p n . O 0 pass through the FC layer to obtain a vector whose dimension clsNum is the number of pedestrian categories in the training set, and then we use softmax and cross-entropy to obtain a loss: where Softmax FC O 0 ð Þ ð Þ i is our predicted value for pedestrian i and q i is the true value.
Owing to each output vector of the final transformer encoder has certain global information and in order to further utilize this global information, we propose the FFLV module, which uses the learnable vector F vt to weight the output. It can be given by the following formula: where f 2 R C 2 ×1 , O 2 R n×C 2 , F vt 2 R n×1 . Then, f passes through FC layer. After the FC, the vector with the dimension number clsNum of pedestrian categories in the training set is obtained, and then the softmax and cross-entropy are used to obtain another loss: where Softmax FC f ð Þ ð Þ i is our predicted value for pedestrian i, q i is the true value. Therefore, the total loss function can be formulated as follows: Among them, λ is the balance parameter, which is used to balance the two loss ratios.

Dataset.
In this section, we briefly introduce the datasets used to evaluate the model (as shown in Table 1).
Since Market1501 and DukeMTMC-reID are two wellknown datasets in person ReID, these two datasets are mainly selected for ablation experiments. In addition, we also do some simple performance experiments on CUHK03.
Market1501 contains 32,217 images and 1,501 pedestrian labels, as well as six camera views. There are 751 pedestrian identities for the training set and 750 pedestrian identities for the test set. In total, DukeMTMC-ReID collected 36,441 pedestrian images, including 1,812 pedestrian identities captured by eight cameras. Among them, 702 pedestrian identities were used as the training set, and 1,110 pedestrian identities were used as the test set. CUHK03 contains 7,264 images of 1,467 pedestrian labels. Among them, 767 pedestrian IDs were used as training sets, and 700 IDs were used as test sets. For the previous data sets, mAP and Rank1 were taken as the main evaluation indexes in this experiment.

Implementation Details.
In terms of implementation, for the shallow layer of the ResNet-50 network, a transformerbased TIC feature extractor is embedded according to different feature map dimensions, which contains a residual structure of double skip connections (as shown in Figure 4). Therefore, the network in the shallow layer has a global information, which is convenient for subsequent feature redirection. The rest of the residual structure parameter settings are consistent with the PCB. More details are shown in Table 2.
In the feature map with an output size of 24 × 12 × C, we use PGAP (six blocks with a size of (24/6) × 12 × C in the comparison experiment with PCB [8]; eight blocks with a size of (24/8) × 12 × C in the comparison experiment with AlignedReID [9]). In the block optimization experiment, it is divided into 4, 6, 8, 12, and 24 blocks, respectively. Each block with a size of (24/p) × 12 × C can obtain 4, 6, 8, 12, and 24 local block features. After that, we add a class token with the same dimension as the block feature for subsequent classification. Then, we access a standard Transformer structure to model the correlation between these block features. In terms of transformer implementation, we set the number of basic blocks (Figure 3(b)) to be six, the number of heads in the multihead self-attention to be eight, and the middle dimension of the two-layer MLP to be 768. The PE is set by Equations (5) and (6) or adopts the way of network selflearning. Then, the classification head class token O 0 is connected to the FC layer, and the output dimension is the same as the number of pedestrian ids in the training set (751 in Market-1501). Finally, softmax and cross-entropy loss are used to train the CET network for pedestrians classification. In FFLV, the output vectors of transformer are concatenated into a matrix O. After that, O is transposed with a learnable weighted vector F vt to perform matrix multiplication to obtain another fusion vector f. The output is mapped to the pedestrian in the dataset through the FC layer. The number of categories(clsNum) is consistent with the dimension of training set, and then we use softmax and cross-entropy loss to train the model. In the end, we introduce λ (λ = 0.7 in the experiment) to balance the TBL.
In training, we use NVIDIA's RTX-2080Ti GPU as the main experimental platform and use the Python-based PyTorch deep learning framework to implement the network. Batch size is set to 64. The initial learning rate lr is set to 0.025 and then attenuated by 10 times for every 10 epochs; the dropout is set to 0.5, and a total of 60 epochs are trained. Figure 5, we compare the convergence speed of PCB, refined part pooling (RPP), and CET model under the same learning rate and other super-parameters in Market-1501 dataset. From the results, we can see that CET model converges faster than that of PCB and RPP. The main reason that the convergence speed of CET is slightly faster than that of PCB and RPP is that the TIC feature extractor is embedded in the network. The internal Transformer structure in TIC is forwarded every time, and all pixel tokens can be covered in the calculation process. So, TIC can help our network to extract global information at a shallow level and transfer it to the next layer through double residual connections (as shown in Figure 4). Therefore, the feature learning is more efficient, and the convergence speed is naturally faster.

Convergence Speed. As shown in
Self-attention in transformer needs to calculate the dot product between the tokens in the feature map. When calculating the self-attention of the dot product, it will be slower than CNN in terms of calculation time. However, since we will map the tokenizer before sending the feature map to the transformer, the calculation speed is improved compared to directly sending the pixel-level token to the transformer.

Feature Redirection Results.
In PCB, we know the phenomenon of cross-aliasing between block features, which is a problem left over by using the idea of slice and block method. For this phenomenon, an RPP method which is used to effectively alleviate the aliasing phenomenon is proposed in the original paper of PCB [8]. In theory, this aliasing phenomenon cannot be completely eliminated but can only be alleviated. In response to this problem, we use TIC's global grasp of shallow features to indirectly redirect features. After the network converges, we calculate the similarity between  The aliasing graphs of PCB and RPP are calculated using the same method and compared with our method. We found that the aliasing phenomenon is effectively alleviated, and our experimental results on Market-1501 are shown in Figure 2. Compared with PCB and RPP, the main reason why CET can further alleviate the aliasing phenomenon is that the transformer structure inside TIC extracts global information in the shallow layer of the network. These global information are helpful for subsequent feature redirection.

Number of Blocks.
In the slicing method of person ReID, the optimal number of slices in PCB and PCB + RPP is six, while in AlignedReID [9] DMLI algorithm (a shortest path algorithm) increased the optimal number of slices to eight. Since Transformer is more suitable for fine-grained information extraction and fusion, CET directly divides the tensor of 24 × 12 × C into 4, 6, 8, 12, and 24 blocks along the height direction. Then we draw its mAP and Rank1 change curve with Market-1501 dataset, respectively, and compare it with PCB and RPP (as shown in Figure 6). We found that the mAP and Rank1 of CET showed an overall upward trend with the increase in the number of slice blocks. When the number of blocks reached 12, the performance of the model could be the same as that of PCB, and when the number of blocks reached 24, it was higher than that of PCB + RPP 0.2-0.3 percentage points. So, for a feature map of 24 × 12 × C, our optimal number of blocks is the maximum number of horizontal slices, 24.
From the results, transformer is indeed better at capturing the correlation between fine-grained features. In CNN, the model performance decreases when the number of slices increases to a certain number. It may be because CNN is not good at learning the correlation between finer-grained features in the deep layers of the network. But training a CNN model does not require very large datasets. Although transformer is good at learning the correlation between more fine-grained features, it also requires a large dataset. This paper combines CNN with transformer to make an effective balance. Therefore, we combine CNN with transformer to horizontally slice and block the 24 × 12 × C tensor to model the correlation between more fine-grained block features and achieve competitive results.

Correlation of Each Part.
Generally speaking, if the parts with relatively high correlation are the same body part, it means that the parts are already aligned. For example, we conduct an experiment on Market-1501. As shown in Figure 7, part 1 and part 3 are relatively similar and both are heads. When the model converges, we demonstrate the correlation of tokens between two images that have been sliced to eight parts of the same pedestrian (as shown in Figure 7) and find that CET can indeed capture the correlation between the same human body part. Compared to directly applying some alignment algorithms (DMLI in Alignedreid [9]), CET can automatically capture this similarity and thus also achieve the alignment effect indirectly.

Ablation Experiment.
In the ablation experiment, we compared the grad-cam heatmap corresponding to the PCB with transformer encoder and without transformer encoder in the dataset of Market-1501. The main method is to calculate the corresponding grad-cam heatmap of the global classification token vector output by the transformer encoder (the O 0 or f vector of the TBL module in Figure 3), and at the same time do the backward calculation and calculate the corresponding grad-cam heatmap in a certain block of PCB. The comparison results of these two heatmaps are shown in Figure 8.
It can be seen from the figure that a certain classification token in the PCB only pays attention to a certain part, while our model has a stronger ability to control the overall Besides, we conduct another ablation study of CET with different kinds of parts RT, FFLV, and TBL in Martket-1501 and DukeMTMC datasets. In this ablation study, it can be found that each part of CET has a certain contribution to the improvement of the network. Finally, CET achieves its best performance when all components are added to the network just, as shown in Table 3.  Compared with several previous state-of-the-art methods, our method achieves competitive results on several datasets. Compared with PCB + RPP on Market-1501, we achieved 94.0% on Rank1, an increase of 0.2%, and on mAP, an increase of 0.3% from 81.6% to 81.9%. Table 4 shows the proposed CET net achieves the best rank-1 and mAP performance under single-query settings.
To verify the statistical significance of the performance of our model, we also executed a Wilcoxon sign rank test. We randomly select two sets of pictures of 50 or more different pedestrians from more than 1,500 pedestrians in the market-1501 dataset and then input them into the trained network model to retrieve the real two sets of rank-k y1 and y2 as stats. The two sets of input y1 and y2 of stats.wilcoxon in Python lib are calculated to obtain a corresponding p-value of about 0.021, which shows that our result is the same probability distribution, not accidental, and is in line with statistical significance. We mainly perform it on the Market-1501 dataset, which shows that the accuracy and mAP rate were statistically significant at a significant level of 2%. In other words, our fault tolerance rate is as high as about 98%, which is not an accidental result.
On DukeMTMC-reID dataset, we achieved 83.2% and 70.2% for Rank1 and mAP, respectively. Although it is slightly inferior to Rank 1, our mAP is also little better than PCB + RPP. Table 5 shows details of this comparison.
On CUHK03 dataset, our Rank1 and mAP are both improved to 64.0% and 60.8%, higher than PCB and AlignedReID. More detail can be seen in Table 6. All the obtained results show that our model is effective for the horizontal block slicing method in person ReID.

Conclusions and Outlook
In this work, we propose a CET architecture for person ReID, which can capture the global information of the feature by using the transformer structure embedded in the TIC in the shallow layer of the network. It can improve the crossaliasing phenomenon existing in the slice and block methods like PCB and AlignedReID. In addition, our CET architecture can use the self-attention mechanism in transformer to indirectly and automatically align the sliced component features without using some alignment or matching algorithms. Moreover, our TIC structure can be easily embedded into the CNN network, and it is not complicated to implement. In future work, we will try to slice vertically or combine horizontal slices with Transformer to model more fine-grained features.

Conflicts of Interest
The authors declare that they have no conflict of interest.