Spatial-Channel Attention-Based Class Activation Mapping for Interpreting CNN-Based Image Classification Models

Convolutional neural network (CNN) has been applied widely in various fields. However, it is always hindered by the unexplainable characteristics. Users cannot know why a CNN-based model produces certain recognition results, which is a vulnerability of CNN from the security perspective. To alleviate this problem, in this study, the three existing feature visualization methods of CNN are analyzed in detail firstly, and a unified visualization framework for interpreting the recognition results of CNN is presented. Here, class activation weight (CAW) is considered as the most important factor in the framework. /en, the different types of CAWs are further analyzed, and it is concluded that a linear correlation exists between them. Finally, on this basis, a spatial-channel attention-based class activation mapping (SCA-CAM) method is proposed. /is method uses different types of CAWs as attention weights and combines spatial and channel attentions to generate class activation maps, which is capable of using richer features for interpreting the results of CNN. Experiments on four different networks are conducted. /e results verify the linear correlation between different CAWs. In addition, compared with the existing methods, the proposed method SCA-CAM can effectively improve the visualization effect of the class activation map with higher flexibility on network structure.


Introduction
Deep-learning methods have made substantial progress in recent years, among which convolutional neural network (CNN) has been very effective for tasks like image classification [1,2], speech recognition [3], and natural language processing [4]. However, owing to the end-to-end "black box" nature of the CNN-based models, the knowledge storage and processing mechanism of the middle layer remains unknown. us, the internal features and the basis of external decision-making by CNN cannot be known clearly, affecting its application to some extent, especially for safetycritical domains. An increasing number of studies [5][6][7][8][9][10][11] have been conducted recently with an aim to explore the "black box" model, focusing on the explainability of CNN's decision. e purpose was to allow CNN to produce the decision result while providing by itself the reason associated with the result.
is would provide explainability and reliability to gain the trust of the end-user. Research on interpretability of CNN has shown significant advance in fields such as recommendation systems [12], intelligent medical treatment [13], and autonomous driving [14,15]. Feature visualization of a trained CNN model is a common way to display the features learnt internally and to explain the reasons behind CNN decision-making. e most direct approach is visualizing the feature maps of each layer [16], which can lead to visual observation of the features learnt inside CNN. Zhou et al. [8] proposed a class activation mapping (CAM) method for interpreting CNN predictions, which inserts a global average pooling (GAP) layer into ordinary CNN to construct the all convolutional network. In this network, they successfully correlated CNN classification results with the features of the middle layer by using the weighted summation among the last convolutional feature maps, which can be used to generate class activation map to locate the important features contributing most to a specific CNN prediction. Due to the utilization of GAP layer, this method can be named GAP-CAM. e class activation map is a class-related heatmap. e highlighted areas in the map indicate the relevant regions that can activate a certain output class of CNN. Selvaraju et al. [9] proposed an improved version, gradient-weighted CAM (Grad-CAM), to solve the limitation of GAP-CAM on network architecture. Grad-CAM generalizes well for most CNNs and reaches a better localization effect on salient features.
A detailed study on the above three feature visualization methods leads us to a new finding. We demonstrate that they are essentially the same as all of them use channel attention on feature maps to generate class activation map. e only difference among them is just the attention weight used across channels. Based on this finding, in this paper, a spatialchannel attention-based class activation mapping method called SCA-CAM was proposed to improve the visual effect and produce better heatmap for interpreting CNN decisions. e contributions of this paper can be as follows.
First, a unified feature visualization framework based on CAM is presented for interpreting CNN classification results. e framework summarizes the representation of the three methods of feature map visualization, GAP-CAM, and Grad-CAM and thus has certain versatility for them.
Second, based on this visualization framework, we give the notion of class activation weight (CAW) with respect to the class activation mapping for the first time. rough the analysis of different situations, the correlation between different CAWs under multiple pooling methods is systematically deduced, and its important role in the generation of class activation map is determined.
ird, to take advantage of different CAWs, a new visualization method called SCA-CAM is proposed. is method combines different CAWs through attention mechanism and makes use of channel features and spatial distribution features of the feature map to generate class activation map. Experimental results show that, compared with the existing methods, it achieves better visualization effects. Furthermore, it is not limited by the network structure, thereby offering higher flexibility.

Related Work
e feature maps of CNN encoded by the hidden layers at different levels have different focuses. Lower layers learn local basic features of the object, such as edges and lines, while those of higher layers learn global complex features, such as shapes and objects [10,16,17]. erefore, the feature maps can be regarded as the feature space extracted from the input image. Visualizing the feature map is helpful in understanding the internal representation of CNN and the feature maps at different layers have different applications in feature visualization.

Feature Map Visualization.
Direct visualization of feature maps can help observe the representation of each middle layer of CNN. As shown in Figure 1, there are two obvious objects in the original image. Figures 1(b) to 1(f) display the outputs in ResNet-18 [1] from the lower layers to the higher layers. e high-level feature representation is more abstract than the lower-level one. e feature map of the highest layer ( Figure 1(f)) can locate salient features with semantic conceptual information, indicating that the feature learning of the network is effective. Figure 1(g) shows the result of the feature map visualization, which uses the feature map of the highest layer overlay on the original image. Feature map visualization directly sums the corresponding positions of each channel of the feature map to obtain a two-dimensional image. At this time, it is equivalent to assigning a value of 1 to the weight of each channel, which means that the importance of each channel to the decision result is the same. erefore, it is unable to determine the relevance of these salient features to the current decision results. In other words, feature map visualization is class-independent and cannot effectively explain the results of CNN.

GAP-CAM.
In order to understand the decisions made by CNN, Zhou et al. [8] made use of feature map weighted by softmax weight to generate a class-specific heatmap, that is, class activation map.
is heatmap can locate the discriminative features of the target regions, which can support the current classification results. Shown in Figures 2(c) and 2(d) are the respective heatmaps of ResNet-18 related to "dog" and "cat," generated by GAP-CAM. e key regions are highlighted to indicate that the features of these regions are most relevant to the current decision.
To clearly describe the details of GAP-CAM, we use a basic CNN structure for comparison. Figure 3 shows the structure of VGGNet-16 [18] containing 13 convolutional layers and 3 fully connected layers, given a three-channel input image with size 224 × 224 × 3, where 224 denotes the height and width. e feature map size of the last convolutional layer is 7 × 7 × 512 (after maxpooling layer). Figure 4 shows the structure of modified VGGNet-16 based on GAP-CAM. Compared with the original VGGNet-16, the last maxpooling layer and the fully connected layers are removed from the modified network and, instead, a convolutional layer, a global average pooling (GAP) layer, and a softmax layer are added. e GAP layer averages the entire feature map into a single value. e yellow layer in Figure 4 indicates the added convolutional layer with K kernels, having a kernel size of 3 × 3, a stride of 1, and a padding of 1. In this network, the process of generating the class activation map H c w is shown by the dashed line. is process denotes a weighted sum between the neuron weights of a certain class in softmax layer and each channel of the highest-layer feature maps.

Grad-CAM.
Although GAP-CAM is simple, its effect is substantial. However, the disadvantage lies in its dependence on the GAP layer, which is not always included in all CNN structures. erefore, it is necessary to modify the CNN structure as shown in Figure 4 when using GAP-CAM, which is a little bit complicated in application. In addition, using global pooling on feature maps will lose a lot of semantic information, which will degrade the performance compared to the original CNN.
To solve the limitation of GAP-CAM in network structure, Selvaraju et al. [9] proposed Grad-CAM. Grad-CAM does not need to change the network structure; instead, it calculates the gradient of a certain class score with respect to the pixel of the feature map and subsequently averages the gradients of each channel to obtain channelwise weight. Figures 2(d) and 2(e), respectively, denote the heatmaps for "dog" and "cat" generated by the Grad-CAM. Figure 5 shows the process of generating the class activation map using Grad-CAM on VGGNet-16. For Grad-CAM, there is no need to retrain the network and update the parameter, which significantly improves its efficiency.

e Unified CNN Visualization Architecture Based on CAM.
e presentations of the previous section reveal that the three methods (feature map visualization, GAP-CAM, and Grad-CAM) all use heatmap to highlight the key regions of the image to identify the features learnt by CNN and interpret its outputs. As shown in Figure 6, the heatmap generation process of them is basically the same. ey all use the weighted sum between the highest-level feature map channels and the corresponding weights. Here, the weight used in feature map visualization is a fixed value without  class information, while weighs of other two methods contain class-specific information.
e process shown in Figure 6 can be formulated as follows: Equation (1) represents the particular case where the CAW is w c � (w c 0 , w c 1 , . . . , w c K−1 ), c represents the class, and K represents the number of channels. e same applies to the other two visualization methods. Direct superposition of feature maps used in feature map visualization is equivalent to setting the weight of each channel to 1. e CAWs used by GAP-CAM and Grad-CAM are not the same, resulting in different weights for each feature map channel. erefore, different CAWs cause diverse visualization effects. From another perspective, feature map visualization, GAP-CAM, and Grad-CAM can all be regarded as methods using a Input image Figure 5: Process of Grad-CAM on VGGNet-16.
Feature maps of the last convolutional layer Figure 4: Modified network structure of VGGNet-16 and process of GAP-CAM. channel attention mechanism for feature maps and assigning different attention weights to each channel. Apparently, different attention weight distributions lead to different interpretation effects of class activation maps.

Class Activation
Weight. By comparing the above three methods, it is observed that the CAWs, w c in GAP-CAM and g c avg in Grad-CAM, play a key role in the generation of the class activation map and determine the effect of visualization to some extent. erefore, to analyze the function of CAWs used in GAP-CAM and Grad-CAM in detail, in this section, we first study the relationship between the two kinds of CAWs in CNN structure with a GAP layer and subsequently remove the GAP layer for further analysis. [1,19,20] which often appears before the fully connected layer. It is also the core component of GAP-CAM. We choose the CNN with a GAP layer. In this way, GAP-CAM and Grad-CAM can be unified into one network without modifying the network structure. In a CNN with a GAP layer, the feature extraction and classification process over the input image is illustrated in Figure 7.

CNN CAW with a GAP Layer. GAP layer is commonly used in modern CNNs
Given an input image, the last convolutional feature map is obtained after feature extraction. Afterwards, it is fed into the GAP layer to obtain the feature vector (m 0 , m 1 , . . . , m K−1 ). Finally, the score vector (y 0 , y 1 , . . .) of all classes in the classification layer (before softmax) is obtained.
is process can be formulated as M l ⟶ GAP m l ⟶ w c l y c , where M l denotes the l + 1(l � 0, 1 , . . . , K − 1) channel of M, m l is the pooled value of M l , and y c denotes the score of class c, which can be computed as follows: where w c l represents the weight connecting m l and the neuron of class c in the classification layer. According to the GAP process, m l can be computed as follows: where M l,i j denotes the pixel at the spatial location (i, j) in the (l + 1)th channel and GAP() represents the global average pooling on the feature map M l . From equations (2) and (3), the score y c depends on both the pixel value M l,i j of the feature map and the weight w c of the classification layer. At this point, the weight Moreover, the CAW used in Grad-CAM can be obtained using the gradient-based backpropagation process. To achieve this, the score y c is backpropagated into the feature space of the last convolutional layer to compute its gradients with respect to the pixels in the feature map: where g c l,i j denotes the gradient of the pixel at (i, j) in the (l + 1)th channel. us, the averaged gradient g c avg,l of this channel is obtained as follows: Note that these gradients come from the derivatives of a specific class score, containing features associated with the class.
At this point, the average gradient g c avg � (g c avg,0 , g c avg,1 , . . . , g c avg,K−1 ). Each channel is just the CAW used in Grad-CAM.
From equations (2)-(5), the relationship between the two kinds of CAWs, w c and g c avg , is given by From equation (6), in a CNN with a GAP layer, there exists a linear relationship between the two kinds of CAWs. Intuitively, as illustrated in Figure 7, the forward process from the multichannel feature map to the score vector only includes a GAP operation, which is a linear calculation.
us, the relationship between the two kinds of CAWs is linear. e class activation maps in Figures 2(c) and 2(e) and Figures 2(d) and 2(f ) are quite similar, which also verifies this linear correspondence.

CNN CAW without the GAP Layer.
GAP layer just employs a special kind of pooling strategy, in which the window size corresponds to the feature map size. For other pooling methods in self-designed CNN, such as average pooling and maxpooling, a smaller window size (e.g., 2 × 2 or 3 × 3) is usually selected to reduce the dimension of the feature map and also retain more semantic information. In this case, the relationship between the two kinds of CAWs is more complex and should be analyzed for different situations.
To facilitate analysis, a 4 × 4 × 3 sized feature map is used as an example. e detailed process is illustrated in Figure 8. e 4 × 4 × 3 sized feature map is pooled using four different pooling methods to output feature vectors. Afterwards, the feature vectors are fed into the classification layer to obtain the binary classification scores, y 0 and y 1 (before softmax). In this process, the four following different pooling methods are, respectively, used:where the CAW, g 1 avg , is still a linear combination of the elements of w 1 . Here, the number of summed elements and the coefficient are still the same as those of case B.
A GAP. e window size of pooling corresponds to the feature map size. According to the analysis in the previous subsection, the relationship between CAWs is as follows: Security and Communication Networks 5 where there is a linear relationship between the two CAWs, whose coefficient is the reciprocal of the feature map size. B Average pooling (2, 2)/2. e window size is set to (2, 2) and stride 2. en, the score y 1 can be computed as follows: where m 0 ∼ m 3 , m 4 ∼ m 7 , and m 8 ∼ m 11 can be obtained using M 1 , M 2 , and M 3 , respectively. According to the average pooling process, we can compute the value of m 0 ∼ m 3 as follows: Similarly, m 4 ∼ m 11 can be computed using the above process. From equations (8)-(12), we know that y 1 is obtained using the weight w 1 s of the classification layer and the pixel values M l,i j of the feature maps. erefore, the gradients of y 1 with respect to the feature map pixels are related to the weights of the classification layer. Using equations (4) and (5), the average gradients of each feature map channel are computed: In this case, the CAW, g 1 avg � (g 1 avg,0 , g 1 avg,1 , g 1 avg,2 ), is a linear combination of the elements of the weight w 1 � (w 1 0 , w 1 1 , . . . , w 1 15 ). e number of elements summed is the same as the number of elements in each feature map channel obtained from pooling and the coefficient value of the linear combination is still the reciprocal of the size of the feature map. CMaxpooling (2, 2)/2. e window size is set to (2,2) and stride 2. In this case, we get the same conclusion as that in case B. ➃ Average pooling (2, 2)/1. e window size is set to (2, 2) and stride 1. In this case, although there exists gradient superposition at the positions of stride overlap, we can compute the average gradients through each channel of the feature map: Although the GAP layer in CNN was not used, the above results show that there still exists a linear relationship between the two CAWs. In this linear relationship, the CAW, g 1 avg , is always the linear combination of the elements of the CAW w 1 , and the coefficient value of the linear combination is always the reciprocal of the feature map size. In other words, the two kinds of CNN CAWs are always consistent. erefore, it is natural to combine them together to fine-tune the generation process of a class activation map for better visualization effect.  Security and Communication Networks

Spatial-Channel Attention-Based CAM.
In the above analysis, we know that the role of a CAW is equivalent to that of a channel-wise attention weight. It performs adjustment across channels to synthesize a class activation map. Considering the consistency of the two CAWs, we propose spatial-channel attention-based class activation mapping method called SCA-CAM. By combining the spatial and channel attentions, the positions and channels with high relevance to the current classification are strengthened, while those with low relevance are further suppressed. e process is shown in Figure 9.

Spatial Attention.
For a single channel of the highestlevel feature map, the spatial distribution of semantic features varies enormously across pixel positions. ese spatial distribution features of pixels cannot be well utilized by using channel attention alone as in GAP-CAM and Grad-CAM. erefore, the spatial attention mechanism is adopted in this study to realize different weights at different positions of each channel to take advantage of this spatial distribution feature. Specifically, by calculating the gradient of each pixel in the feature map, a class-specific spatial attention weight matrix, namely, a pixel-level gradient matrix, can be obtained as follows: where g c l ∈ R H×W denotes the (l + 1)th channel of the gradient matrix and each element is a pixel's gradient with respect to the score of class c. is matrix contains both the important features of each spatial position and the features related to the output class, which can achieve a pixel-level attention weight.

Channel Attention.
In channel attention mechanism, each channel is regarded as a whole. Each channel corresponds to a different feature and contributes differently to different output classes.
erefore, different attention weights should be assigned to each channel when generating the class activation map.
where w c l ∈ R represents the channel attention weight of the (l + 1)th channel related to class c.

Combining Spatial and Channel Attention.
According to the unified framework presented in Section 3.1, the spatial attention weights g c and the channel attention weights w c are combined to generate the class activation map as follows: where M l ∈ R H×W denotes the (l + 1)th channel of the feature map; g c l represents the spatial attention weight matrix of this channel; w c l represents the channel attention weight of this channel. e order of the two attention weights does not affect the final result.
In the CNN with a GAP layer, there is a linear relationship between the two CAWs, w c and g c avg . erefore, combining equation (5) and (6), equation (21) can be simplified as In equation (22), both the spatial attention and channel attention weights are composed of gradients.
In the CNN without the GAP layer, when avgpool (2, 2)/ 2 or maxpool (2, 2)/2 is adopted for pooling, the attention weight of the first channel can be obtained from equations (5) and (13): where s represents the element number in the pooled feature map. In this case, ignoring the influence of the coefficient 1/ij, the channel attention weight s w 1 s can still be replaced by the pixel-level gradients: erefore, in this case, equation (22) is still right. Similarly, when using the methods of avgpool (2, 2)/1, equation (22) can still be derived from equations (5) and (16).
In conclusion, under the unified framework presented in Figure 6, SCA-CAM can be formulized using equation (22). It combines the advantages of spatial and channel attention weight and integrates the representation of the two CAWs, under different pooling methods, into a unified form. As a consequence, there is no need to rely on softmax weight, which simplifies the process while making use of more features.
Note that, in literatures [6,21] and [22], channel attention or spatial-channel attention mechanism was added in CNN. e attention weight was adjusted along with the network parameters to improve the performance of CNN classification. In contrast, the proposed SCA-CAM only realizes visual interpretation of CNN output. e attention weight used in this study is composed of gradients and can be obtained offline without network training. is is also a difference between SCA-CAM and other methods.

Results and Analysis
e pretrained image classification models used in the experiments are provided by torchvision package [23,24], including SqueezeNet [25], ResNet-18 [1], ResNet-50 [1], and DenseNet-161 [19]. ese networks were trained to the best performance on the ImageNet dataset [26]. e error rates of them are listed in Table 1. eoretically, models with better performance show stronger ability for feature Security and Communication Networks 7 representation and location of key features. Because the feature visualization aims to interpret the pretrained CNN classification results, network training is not required.

Visualization and Comparison of the CAWs.
CAW is important for generating the heatmap. As mentioned above, the CAWs can be divided into two types: in GAP-CAM, CAW denotes the weight of softmax layer, and in Grad-CAM, CAW denotes the averaged gradient of each channel for a particular class score. To obtain the CAW, given the input image shown in Figure 1, the predictions are shown in Table 2, where C denotes the class name and P denotes the corresponding probability.

Comparison of the Different CAWs for the Same Output
Class. e two types of CAWs of each network are illustrated in Figure 10. Taking the SqueezeNet as an example, the weights corresponding to 50 channels were randomly selected from the 1000 channels. Because the gradient value is infinitesimal and has a large difference from the weight value of the classification layer, the average gradient value increased by 100 times during the mapping to facilitate the comparison, which will not affect the comparison. ere are two types of CAW shown in Figure 10: (1) Softmax weight: It represents the weight of a certain neuron (class) in the softmax classification layer, that is, the first CAW. (2) Average gradient: It indicates the gradient average of the feature map for a certain class, that is, the second CAW. Figures 10(a) and 10(b), respectively, show the corresponding two types of CAWs with respect to "tiger cat" and "bull mastiff," classified by SqueezeNet. Among them, the horizontal axis represents each feature map channel (randomly selected) and the vertical axis represents the size of the two kinds of CAWs, corresponding to the channel. Obviously, there is a correspondence between the two CAWs and, in addition, the numerical values always show the same fluctuation, indicating that a linear relation exists. Similarly, Figures 10(c)-10(h) represent the corresponding CAWs of other three networks. Again, a similar linear relation is observed. More precisely, to calculate the correlation coefficient between each pair of curves, the softmax weight is divided by the average gradient to obtain the specific values of the correlation coefficient: α SqueezeNet � 1 · 72, α ResNet−18 � 0 · 49, and α ResNet−50 � 0 · 49. Because the DenseNet-161 adds a ReLU layer after the last convolutional feature map, the gradient of the backpropagation also passes through this layer, so the result is slightly different from the other three networks and does not reflect a strictly consistent correlation.  Figure 9: Process of SCA-CAM.  Figure 11(a) shows the visualization of the softmax weight corresponding to the top three classes. Similarly, Figure 11(b) illustrates the average gradients corresponding to the top three classes. In Figure 11, for the CAW of the same type, the corresponding weight values of different output classes vary considerably on the same channel, indicating that the contribution of the channel to each output class is significantly different. Owing to the difference in the weight, the weighted summation between the weight and the feature map can produce different class activation region effects. Concurrently, a horizontal comparison of the weight curves corresponding to each class in Figures 11(a) and 11(b) further verifies the conclusions of the previous section.

Visual Effects of the Different Methods.
Here, we will inspect the localization effect of the class activation map generated by SCA-CAM and make comparisons with those of other methods. For the same input image, the visualization effects of three methods, GAP-CAM, Grad-CAM, and SCA-CAM, are compared under four CNNs: SqueezeNet, ResNet-18, ResNet-50, and DenseNet-161. All of these four CNNs contain a GAP layer (or a layer with the same function as GAP layer) in their structure. erefore, according to the analysis in Section 3.2, GAP-CAM and Grad-CAM can be used simultaneously for comparison. Results are shown in Figure 12.
From a horizontal perspective under the same CNN, the localization effect of the proposed SCA-CAM is better than that of GAP-CAM and Grad-CAM. Because the attention weight of SCA-CAM contains two types of CAWs, this method offers better performance in distinguishing regions of interest.
From a vertical perspective, under the same feature visualization method, the localization effects under different networks are shown for comparisons. In Table 1, the error rates of the four networks are in the following order: SqueezeNet > ResNet-18 > ResNet-50 > DenseNet-161. e results shown in Figure 12 indicate that the higher the accuracy of the network, the better the localization effects of the heatmap. Intuitively, the improved CNN makes the feature maps more focused on the target object and leads the network to learn more comprehensive features. erefore, the heatmap generated on CNN with a high accuracy is better than those with a low accuracy.

Class Discriminative Visualization Using SCA-CAM.
e CAWs used by SCA-CAM are directly related to the output classes. erefore, SCA-CAM can visualize the features of a specific class and locate the region of interest related to the class. Figure 13   In class activation map, the most relevant image region to the specific class is highlighted. According to the results shown in Figure 13, the visualization effect is closely related to the output class, and the CAWs corresponding to various classes are significantly different. erefore, the generated maps can realize the interpretation of specific output classes. Also, the visualization effect is independent of the score corresponding to this class. is means that the probability that an image belongs to this class will not influence its visual interpretation.

Ability of Localizing the Same Object Class.
Here, we select multiple images of the same class and visualize the key features among them to test the ability of SCA-CAM to locate similar objects from the different images. Test images come from ILSVRC 2012 dataset [26] and Tiny ImageNet [27]. Images selected from Tiny ImageNet dataset can be used to test the transferability of the proposed method. Shown in Figure 14 are the results on different images belonging to four classes, each of "airliner," "hartebeest," "spider," and "butterfly." e results indicate that, for the  images in the same class, the SCA-CAM can effectively locate the regions related to the target with the same class. Even for the image with multiple objects, the regions corresponding to these objects can be located simultaneously. Furthermore, for targets with very similar contexts in some images, the proposed method can still find reasonable regions to explain the current classification results, indicating that the SCA-CAM has promising robustness for images with complex contexts.

Conclusions
In this paper, a unified CNN feature visualization framework based on CAM is presented. Under this framework, a detailed analysis of the CAWs in different pooling situations is conducted, and a consistent linear relationship between different CAWs is found. Furthermore, a spatial-channel attention-based class activation mapping method SCA-CAM is proposed. Considering both channel and spatial distribution features, the proposed method combines different CAWs as attention weights, which can improve the visual effect of the class activation map. Compared with existing methods, the proposed method can effectively improve the effects of class activation maps and be applied to multiple CNN networks. e interpretability of CNN is significant for some special fields, such as smart medical care, financial lending, and autonomous driving. Only an interpretable and transparent CNN-based model can support their safe use. In the future, we will explore to achieve fine-grained interpretability with the improvement of this method to reduce visual noise in heatmaps and further study the applications of it in other fields.  Data Availability e datasets used in the experiment are ImageNet dataset and Tiny ImageNet dataset. ey can be downloaded at http://image-net.org/download.php and http://cs231n. stanford.edu/tiny-imagenet-200.zip.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.