Refined Division Features Based on Transformer for Semantic Image Segmentation

,


Introduction
In recent years, deep learning computer vision has been rapidly developed in the felds of semantic analysis [1], image repair [2,3], object tracking [4], super resolution reconstruction [5], and object detection [6] and has been favored by many researchers. Semantic segmentation is a fundamental technique in computer vision. Te role of semantic segmentation is to predict each category for each pixel and learn the semantic and spatial information of each class, for example, the locations and classes of objects. Semantic segmentation generally deals with the issue of classifying pixels at the pixel scale, and it typically needs rich contextual semantic information. With the advancement of convolutional neural networks (CNNs), particularly a fully convolutional network (FCN) [1], many researchers have started to focus on the study of multiscale contextual features for semantic segmentation, for example, SegNet [7], DeconvNet [8], and the DeepLab series of articles [9][10][11]. DeepLabv3 [9] and DeepLabv3+ [10] use atrous convolution with diferent hole rates to extract multiscale contextual semantic information, which is not benefcial to dense segmentation. DeepLabv3 excessively uses atrous convolution, which has generated a grid efect. Similarly, PSPNet [11] uses a pyramid pooling block consisting of adaptive average pooling at diferent scales to extract multiscale contextual semantic information, which has the drawback of not considering the relationship between pixel dots and the neighboring pixel set. Since PSPNet pixels do not have rich upper and lower information, SA-FFNet [12] proposes VH-CAM and UC-PPM to improve the upper and lower information of pixels. However, SA-FFNet uses pyramid pooling to enter pyramid pooling of advanced feature maps, which will inevitably cause some loss of efective information. In addition, multiple UC-PPM models will be used to compound model structures. RELAXNet [13] proposes that AGF module uses maximum pooling and average pooling, and the resulting spatial weights are not distinguishable. After being multiplied with feature maps, feature maps cannot be efectively distinguished.
EncNet [14] uses a nonadaptive method to assemble contextual semantic information and uses a homogeneous context extraction process for all pixels, which does not meet the need for diferent context dependencies for each pixel.
An attention mechanism has recently been integrated into semantic segmentation networks, which counts the similarity of adjacent features to obtain contextual semantic information about pixels. For example, PSANet [19] aggregates the contextual semantic information at each position by predicting the attentional feature map. ANNet [20] improves semantic segmentation precision using longdistance asymmetric dependencies between pixels and neighboring pixels. A pixel-region relationship is calculated using OCRNet [21] to improve pixel-region representation. DANet [22] calculates pixel-to-pixel distances on feature map channels and spatial locations to improve pixels' presentation. DADCNet [23] uses the SE structure to learn the relevant information of the feature maps between the channels, so that the network attention is focused on the useful feature maps.
Based on SA-FFNet, we used group convolution, adaptive maximum pooling, and adaptive average pooling to integrate more image and texture information into the channel attention map and used the channel attention map to update each adjacent channel's feature map. In addition, we used two fully connected layers for the feature maps. Te two fully connected layers not only efectively combine the linear information between the channel feature maps but also can establish the information interaction between the channel feature maps.
With the advancement of computer technology and CNNs, automatic end-to-end segmentation is now possible. Small objects, such as pedestrians, trafc signals, and trafc signs, can be segmented more efciently through the accurate segmentation of object regions from images. A network has fner spatial information at the lower levels, but its semantic coherence is poor. Feature maps at the highlevel of the network provide consistent semantic information, but their spatial information is coarse. To address this problem, literature [24] took advantage of color features and edge features to improve the face tracking reliability. DFNet [25] adopts a V-shaped structure instead of a U-shaped structure to capture multiscale contextual semantic information. An SFS and FFF module is proposed by FSN [26] to extract important features and merge them adaptively.
In a refned division feature (RDF) block, we generated a channel attention matrix to distinguish the importance of feature channels inspired by DFNet and FSN [26]. Feature maps extracted using conventional CNNs gradually decrease in resolution, and the perceptual feld is limited to global information and long-range pixel dependencies.
Since the introduction of the transformer [27] in computer vision, semantic segmentation has improved dramatically. Transformer is capable of capturing global information; it can compute dynamic weights between global pixels, and it can adapt dynamically to diferent input images. Tese properties are very useful for obtaining highlevel semantic information, but they are only helpful if there are sufcient data to support the transformer.
Furthermore, the transformer cannot deal with fne details in the images. Tis is particularly true for small targets at long distances in Cityscapes that contain multiscale targets. CMT [28] solved this problem by combining deep convolution with the transformer, where deep convolution extracted local features to compress channels and the transformer established global interdependencies between the patches. Convolution and transformer were used by UniForme [29] to extract global and local information, which efectively addresses the problems of redundancy and dependency in the learning process of the networks.
Our RDF module not only distinguishes feature map information from feature maps adaptively and removes redundant information but also establishes long-distance links to each channel's feature maps. Furthermore, we proposed a module called the CTrans block, which includes depth-separable convolution, cross-attention, and selfattention. First, RDF module processes input feature maps in parallel to establish cross-channel interactions to produce feature maps with diferent resolutions and depths; then, instead of using the multilayer perceptron (MLP) in the transformer, depth-separable convolution is used to extract multiscale spatial information. Finally, global feature information is captured using the transformer attention. Based on the number of parameters and accuracy of RFTNet, we compared our network with some classical networks. As shown in Figure 1, our network outperforms other classical networks not only in terms of the number of parameters but also in terms of segmentation accuracy.
For this paper, the main contributions are as follows.
(1) Because existing attention models do not consider the dependencies between feature graph channels from a macroscopic perspective, this paper proposes refned division feature (RDF) module, which can extract multiscale spatial information and establish long-distance channel dependencies. RDF is very fexible and scalable, and it can be applied to many computer vision network architectures.
(2) For existing models, transformer self-attention is used to establish long-distance dependence between pixels to improve the accuracy of semantic segmentation, which does not consider the advantages of combining convolution and transformer. From the micropoint of view, transformer based on convolution (CTrans) is proposed in this paper, which can not only enhance pixel representation and enrich

Related Work
In this section, we briefy review the related work, such as multiscale feature contextual information and convolution combined with the transformer.

Multiscale Feature Contextual Information.
To combine semantic and representational information, an FCN adapts the classifcation network into a fully CNN, but this ignores high-resolution feature maps, which degrades the edge information.
To maximize both representation and semantic information, some subsequent studies have been conducted, which have shown improved feature information, which can broadly be divided into two categories: the frst is the "Encoder-Decoder style." For example, SegNet uses an encoder-decoder structure and maximizes pooling preservation. Tis approach not only saves memory but also increases segmentation speed and accuracy. MCRNet [30] aims to use deep context to guide multistage fusion. Its disadvantage is that in the case of deep convolutional neural network ResNet, fusing feature maps of diferent stages will cause model redundancy and slow speed. When the perceptual feld is too small and the object is too large, missegmentation can occur because the network ultimately does not see the object. For example, CENet [31] proposes a contextual integration network to achieve semantic segmentation. Its disadvantage is that it requires high hardware and software, and it is easy to cause overftting.
Furthermore, suppose that the object is too small and the perceptual feld is too large. In this case, the network will see additional backgrounds and redundant information, resulting in misclassifcation, because the network will have difculty judging the tiny object. For example, Zhang et al. [32] proposed a method based on pyramidal consistency learning to improve the accuracy of segmentation. Its disadvantage is that it requires large computing resources, and the feature processing is not sufcient, which may lead to the loss of some detailed information. CCTseg [33] uses the prediction results obtained by DeepLabv3+. However, DeepLabv3+ uses a feature map obtained by roughly 4 times upsampling and fuses it with the feature map of the encoder, which is easy to cause resolution loss and is not conducive to semantic segmentation. To overcome this problem, DeconvNet proposes a deep convolutional network based on SegNet, where the encoders use VGG-16 convolutional layers to learn and the decoders use deconvolution with inverse pooling to upsample. RefneNet [34] and GCNet [35] combine feature maps inherent in diferent stages of multiscale contexts, but they lack a consistent global context.
Te second style is called the "Backbone style," where DeepLabv2 uses atrous convolutions, using diferent sampling scales and input features, to capture target and contextual semantic information at multiple levels. To solve the  International Journal of Intelligent Systems multiscale segmentation problem, DeepLabv3 developed cascaded or parallel atrous convolutions and expanded ASPP, achieving good results without requiring dense CRF postprocessing. To extract multiscale contextual semantic information, PSPNet incorporates information at diferent scales into the PSP module. A self-attentive method is used by OCNet [36] to learn pixel-to-pixel similarities; then, all features are aggregated using a similarity attention graph to approximate an object's context. OCRNet enriches pixelregion representations by computing pixel relationships with the regions. Te spatial attention module of DANet and the channel attention module are used to capture contextual information to improve pixel representations, which improves segmentation performance.
Te two methods mentioned above have two disadvantages. First, the afnity matrix is calculated by comparing pixels with other pixels, but the contextual information about a single pixel is minimal. Tus, the afnity matrix obtained is unsatisfactory. In addition, these methods focus on developing complex attention modules, which inevitably involve more computations and cannot efectively establish long-distance dependencies. To reduce the matrix computation complexity and efectively fuse global and local information, we consider improving the correlation between afnity matrices and long-distance pixel dependencies. In this study, we proposed two modules: RDF for obtaining discriminative feature maps and CTrans for integrating global and local information and establishing interpixel relationships over long distances.

Combination of Convolution and Transformer.
As CNNs share weights and local felds of perception, they can effectively reduce the number of computational parameters while extracting spatial details. Te translation invariance of a CNN also enhances its generalization ability. Nonetheless, the perceptual feld of CNNs is limited, so they cannot capture global information and interdependencies between pixels at a distance.
On the other hand, the transformer has strong abilities to extract global information and expand the perceptual feld, but it has two drawbacks: frst, it is challenging to train, and second, it is not sensitive to fne details. To solve these problems, subsequent researchers combined the advantages of CNNs with those of the transformer. UniFormer uses both CNNs and transformer to efectively solve the redundancy of network learning as well as the long-distance interdependency between the pixels. AMACF [37] combines self-attention mechanism and convolutional network to extract global and local information of feature maps, respectively. Tran-sUNet [38] and TransBTS [39] combine transformer and UNet [40] and apply them to the feld of medical image segmentation and achieve very satisfactory segmentation results, but their disadvantage is that they require a large amount of training data and computing resources to train and optimize models, so it is impossible not to be used in resource-constrained and limited-number environments. TranSiam [41] proposes a method of combining depthwise separable convolution and transformer, which combines the advantages of both and reduces the amount of calculation. Its shortcoming is that it uses multihead attention, which makes the model more complex and requires more computing resources and time. Coatnet [42] proposes a combination of convolution and attention, which can adapt to processing images of various scales and improve the accuracy of segmentation. Its disadvantage is that multiple convolution kernels of diferent sizes are used, resulting in larger model parameters, which easily lead to model overftting.
In addition, AMACF can adaptively distinguish the importance of feature maps according to the weights generated by the self-attention mechanism and the convolutional network, which makes the matrix operation simple.
Te conformer [43] uses a CNN to extract local features and transformer to establish global relationships. Mobile-Former [44] is used to extract local features at the pixel level with efcient depthwise and pointwise convolution; by combining convolution with the transformer, global interaction is improved, and the number of randomly generated tokens is reduced. Several convolution kernels frequently lead to very computationally intensive and random image splits, so encoding patches become unstable when using several convolution kernels. Echt [45] suggested using a few convolutional kernels and transformers to encode patches to solve the problem of unstable network training and improve the segmentation efects.
Te CTrans module in this study is based on the above methods, and it comprises multilayer convolution, crossattention, and self-attention. As components in the CTrans module, convolution is used to extract local features in the early stages, cross-attention is used to improve the pixel representation in the middle stages, and self-attention is used to construct global contextual semantic information about pixels in the later stages; fnally, local features are combined with global features.
Trough convolution, local features are extracted further, and the number of channels is reduced, resulting in dense segmentation with rich contextual semantic information for each pixel. CTrans combines the advantages of convolution and transformer to capture global information and establish long-distance interdependencies among the pixels.

Method
Tis study presents a segmentation model based on RFTNet that combines the RDF and CTrans modules. First, we describe RFTNet in Section 3.1; then, we introduce the RDF and CTrans modules in Sections 3.2 and 3.3, respectively. Figure 2, we briefy discussed refned division features based on transformer for semantic image segmentation (RFT).

RFTNet. As shown in
First, we use ResNet101 to generate feature maps of the ffth stage. Next, we use the RDF module to identify the importance of the feature maps, and we use the RDF module to integrate the spatial information of feature maps and the channel information of feature maps into group feature maps to obtain better information interaction in global and local channel attention, which adaptively distinguishes channels according to their importance. We use the CTrans module to extract the spatial information about the feature maps, and we use the similarity between pixel points to improve pixel representations. Finally, we use the CTrans module to extract the spatial details of feature maps, improve the pixel representation, and establish the global information relationships, and we use the FCN Head [1] upsampling method to obtain the same resolution feature maps for the multiscale feature maps.

Refned Division Features
Module. According to previous research [1-14, 46, 47], fusing multiscale features can improve semantic segmentation in images with objects of diferent sizes. ResNet101 [48] deals with multiscale features of varying resolutions in each stage of a feature map. Lowresolution feature maps contain more semantic information than high-resolution feature maps, but high-resolution feature maps contain more detailed spatial information.
In addition, large-scale objects contain weak semantic information after multiple downsampling because they have a limited perceptual feld, whereas small objects contain clearer location information.
As shown in Figure 3, without increasing the computational cost, group convolution with diferent kernel sizes extracts multiscale feature map information with multiple branches. Terefore, feature maps of diferent resolutions and depths can be obtained. For each branch, group convolution can learn independent multiscale spatial information, and the RDF approach involves incorporating adaptive global average pooling (GAP) and global max pooling (GMP) modules within the framework of the RDF (residual dense feature) module. Te primary objective is to efectively encode the pertinent information from the feature map into the channel attention map. Tis enables the RDF module to aptly discriminate and diferentiate features across varying scales or dimensions, thus enhancing its ability to accommodate multiscale characteristics present within the data. A channel attention mechanism can assign diferent weights to each feature in a feature map, thereby generating more information. According to SAFFN, each pixel point in a feature map lacks sufcient contextual semantic information; thus, SAFFN uses irregular convolutions for channel compression, obtaining pixel points with robust contextual semantics. RELAXNet suggests that GMP can extract salient features from the feature maps.
As shown in Figure 3, in our approach, we extracted the spatial information from the input feature map using the multigroup convolution.
A feature map of RDF is given as follows: where f represents the feature map output by the backbone network and C, H, and W represent the number of channels, height, and width of the feature map, respectively. After the group convolution operation, the channel dimension of C is divided into C/4 in the RDF module. Trough the group convolution method, input feature maps are processed simultaneously, and the resolutions of different depths are compressed, resulting in richer feature map information and efective extraction of spatial information from each channel of the feature maps. Te spatial information about these channels can also be linked via group convolution for information interaction across channels.
We combine adaptive maximum pooling and adaptive global average pooling to extract feature map information. Adaptive global average pooling encodes the spatial information of a feature map into a channel attention map, whereas maximum pooling removes redundant data to reduce computation costs and alleviate the overftting of the network. We added two fully connected layers, a nonlinear activation function (ReLU) and a softmax function, to the pooled feature map, following EPSANet [49].
Trough these two fully connected layers, we can better combine the linear information between channels and improve the interaction between the channels. Next, the softmax function converts fractional values in each channel into probability values and multiplies those probability values by the feature maps in each channel so that information can be extracted efciently. Te attention weights of the RDF module channels are calculated as follows: where G i ∈ R C/4×H×W , B i ∈ R D i ×1×1 , GAP stands for the adaptive global average pooling, GMP stands for the adaptive global max pooling, G i represents the feature maps,  International Journal of Intelligent Systems and φ stands for the phase i group convolution, where the group size is g u and the convolution kernel size is k v . In this block, F 1 represents the frst fully connected layer, F 2 represents the second fully connected layer, δ represents the ReLU activation function, σ represents the softmax function, and B i represents the channel attention map. Te sigmoid and softmax activation functions are as follows: where D i denotes the i-th channel weight value. Finally, the output feature map is computed as follows: where E i ∈ R C/4×H×W denotes the feature map output of the RDF module at i-th branch (i � 1, 2, 3 and 4).

Combining Convolution and Transformer Modules.
Tis section describes the combination of convolution and transformer (CTrans) modules in detail. Figure 4 illustrates the structure of the CTrans block. It comprises depthwise convolution, cross-attention [50], and self-attention. Te depth-separable convolution is used to extract spatial information from a feature map and to improve the interactions between the feature map information on each channel. Trough the depth-separable convolution, the convolutional properties of the transformer are improved while reducing the computational cost. Te CTrans module captures both the global and spatial details of a feature map. By replacing the transform in the input of the transformer with depth-separable convolution according to CMT, we can reduce the computational cost of encoding features while maintaining high accuracy. Terefore, the CTrans module can extract multiscale features and reduce the independence of pixels on diferent channels by associating pixels with each other. CTrans divides the feature extraction process into two phases: local feature extraction and long-range interdependency creation, i.e., a phase for establishing global information relationships. Trough multilayer convolution, the local feature extraction phase not only extracts more pixel space information but also reduces information loss. As shown in Figure 4, the input feature map for the CTrans module is F * ϵR C * ×H * ×W * . In the CTrans module, considering that the feature maps for Q 1 and K 1 branches are highly correlated, we can improve the similarities of V 1 by establishing relationships between them. We use crossattention to calculate the similarity between Q 1 and K 1 to produce the spatial attention map; then, we apply the spatial attention map to weight V 1 .
Te calculation of cross-attention is as follows: where CAttn ∈ R C * ×H * ×W * denotes the cross-attentive output feature map, K 2 , Q 2 ∈ R N×d k 2 , and N � H * × W * denotes the number of pixels in the feature map.
In the second step, we channel-compressed the feature maps output by depth-separable convolution and extracted the spatial details. Depth-separable convolution is calculated as follows: where Depth Conv refers to depth-separable convolution, including point-by-point and depth-by-depth convolutions, I ∈ R C * * ×H * * ×W * * , C * * � C * /n, H * * << H * , and W * * << W * . Te input feature map is fltered by depth-by-depth convolution, and the input feature map channels are integrated by point-by-point convolution.
Finally, we used self-attention to establish the correlation between global pixels. CTrans combines the feature maps of high-level and low-level stages. To generate segmentation results, we fed the fused features into a classifer. We computed the inner product of Q 2 and K 2 through selfattention and multiply on the V 2 branch to establish a global pixel relationship. We then fused the output feature map with the feature map Q 2 of the lower stage of the CTrans block to refne the segmentation.
Te calculation of self-attention is as follows: where SAttn ∈ R C * * ×H * * ×W * * denotes the self-attentive output feature map, K 2 , Q 2 ∈ R N×d k 2 , and N � H * * × W * * denotes the number of pixels in the feature map. We stack self-attentive output feature maps and the shallow feature maps Q 2 to produce robust feature maps that contain both spatial information and contextual semantics. We stack the feature map output using the CTrans module and then input them to the category classifer.
Y � fcn class (T), where T represents the feature map output by the CTrans module at each stage and fcn class represents the classifer. Y ∈R K×H×W represents the output result map, and K, H, and W represent the number of feature map channels, height, and width, respectively.

International Journal of Intelligent Systems
Te Cityscapes dataset contains 5,000 high-quality pixel level annotated images of urban driving scenes, categorized into 30 categories. Of the 5,000 images, 2,975 were used for training, 500 for evaluation, and 1,525 for testing. Te images were taken in 50 diferent cities. Tis dataset also contains 19,998 coarsely annotated images; here, we only used fnely labeled images for 19 categories.

Implementation Details.
To train the model on the Cityscapes dataset, we used stochastic gradient descent (SGD) [53] using a poly-learning rate decay strategy, where the initial learning rate is multiplied by (1 − iter/max iter) power . For training and validation on the Cityscapes dataset, we used a 0.0025 learning rate, 0.9 weight decay, and 0.0005 momentum.
During the training and validation phases, we cropped the original images to 1024 × 512 for Cityscapes and 512 × 512 for PASCAL VOC 2012. Te input image is randomly scaled from 0.5 to 2 and fipped horizontally during training for data augmentation. Our backbone network is ResNet101, pretrained on the ImageNet dataset [54]. For the PASCAL VOC 2012 and Cityscapes datasets, the batch sizes were 8 and 4 and the training epochs were 350 and 400, respectively. According to PSPNet, our models are optimized using two cross-entropy losses. Te frst loss function was applied to the output of the fourth stage of ResNet101 and the second to the model's output. Terefore, the total loss function is as follows: where l backbone stage4 indicates the loss function at the output of the fourth stage of the backbone and l model represents the loss function at the output of our model. λ is set to 0.4.

Evaluation Metrics.
In this paper, we use pixel accuracy (PA), intersection over union (IoU), and the mean of IoU (mIoU) as our evaluation metrics. Teir calculations are as follows: where PA represents the ratio of correctly identifed pixels to the total number of pixels. IoU, for each class, is calculated as the intersection and concatenation of true and predicted values. mIoU is used to calculate this indicator, frst calculate the IoU for each category and then calculate the average. If there are k + 1 classes, p ij represents the number of pixels initially in class i but are predicted to be in class j. Terefore, P ii is the predicted true number and p ij andp ji denote false positive and false negative, respectively.

Ablation Study.
In this section, we conduct ablation experiments to verify the efectiveness of our method. We validate the efectiveness of the RDF and CTrans modules on the PASCAL VOC 2012 and Cityscapes datasets, respectively, through ablation experiments. Figure 3, the RDF module can adaptively select features based on various channels. In the group convolution module, feature map channels are compressed to produce compressed feature maps of various spatial information scales. Adaptive maximum pooling can be used to extract salient feature information, whereas GAP can reduce the domain size constraint to preserve more information. An adaptive maximum pool can highlight the unique performance of some features, whereas an average pool can conserve more efective characteristics. As a result, combining adaptive global averaging and adaptive maximum pooling can provide a rich set of information. Te sigmoid and softmax functions generate channel attention values, and the softmax function establishes long-run channel dependency and calibrates channel attention weights. In order to prove the infuence of adaptive average pooling, adaptive maximum pooling, sigmoid function, and softmax function on experimental results, we successively add adaptive average pooling, adaptive maximum pooling, sigmoid function, and softmax function to the RDF module. Table 1 shows the experimental results. In the frst row, GAP combined with the sigmoid function achieves an mIoU of 78.11%, but in the second row, GAP combined with the softmax function achieves an mIoU of 78.72%, which is an improvement of 0.61 compared with the combination with the sigmoid function. In the sixth row, after we included GMP, the corresponding metric reaches 78.88%, which is an improvement of 0.16% compared with the previous value of 78.72% in the second row. Accordingly, GMP improves performance by 0.16%. Te Cityscapes dataset has 19 categories. Each category may appear in diferent scenes, so we embed information extracted from feature maps using adaptive maximum pooling and adaptive average pooling in the channel attention map. Channel attention map assigned diferent weights to each feature map on the channel, and improved feature diferentiation can improve segmentation accuracy.

RDF Module. According to
We verify the performance of the CTrans module in the network using varying convolution kernel group sizes and convolution kernel sizes. We initially set the convolution kernel group size and the convolution kernel size to 1, 2, 4, and 8 and 1, 3, 5, and 7, respectively, achieving an mIoU of 77.95%. Next, we set the convolution kernel group size and convolution kernel size to 1, 4, 8, and 16 and 1, 3, 7, and 9, respectively. Tis achieves an mIoU of up to 78.88%, an improvement of 0.93% compared with 77.95%. Ten, this article sets the convolution kernel group size to 1, 2, 2, and 8 and 1, 4, 4, and 16, respectively, and the corresponding convolution kernel size is set to 1, 4, 4, and 16 and 1, 3, 3, and 9, respectively. Te obtained mIoU is 77.76 and 77.62, respectively. Considering the above data, we selected 1, 4, 8, and 16 and 1, 3, 7, and 9 as the parameters of the model.
Te metric values decrease as the group size and convolution kernel size increase, as shown in Table 2. Using multiple convolutional groups can increase the speed of 8 International Journal of Intelligent Systems model training by allowing the model to be trained in parallel simultaneously. However, training models in parallel and optimizing them with SGD can lead to slow convergence and poor accuracy depending on the input image batch size. To fully exploit multiscale information from the feature map, the group size and convolution kernel size must be appropriately increased. Table 2 shows that when we set the group size and convolution kernel size to 1, 4, 8, and 16 and 1, 3, 7, and 9, respectively, the model is optimal.
In the CTrans block, we verify the combination of convolution and transformer to improve the model performance. Diferent feature maps on diferent channels have interrelated information, so we compute the similarity of the feature maps on two branches to weight the feature maps on the third branch. In addition, we embed convolution in the transformer, enabling the CTrans module to extract spatial information as well as build global information.
We conduct ablation experiments to accurately verify the efect of embedding convolution. Table 3 shows that the experimental results using depthwise conv embedded in the transformer is 78.88%, and the experimental results using the original transformer is 78.66% with 0.22% performance improvement. Te experimental results are superior to the results obtained when depthwise conv is used instead of the MLP linear projection layer. Tus, convolution can extract local information and spatial location.

CTrans Block.
Te multilayer perceptron (MLP) only has the function of linear mapping and has no feature extraction function, so it is not sensitive to spatial details. According to the experimental results, depthwise conv signifcantly improves the model results compared with the linear projection by MLP. We used ResNet50 as the backbone network and observed the efect of the two modules on the network gain to evaluate the efectiveness of the cross-attention and self-attention modules. Table 4 shows that when using only cross-attention and self-attention, an mIoU of 78.62% and 78.74% are obtained, respectively, indicating that self-attention is 0.12% more efective than crossattention. Combining the two attention mechanisms achieves an mIoU of 78.88%, which is 0.14% higher than that when only self-attention is used. Multiscaling and horizontal fipping are further applied to the Cityscapes dataset, which achieved an mIoU of 79.31% and 79.56%, respectively. Terefore, the more data there are, the more efective the transformer is.
To fully demonstrate the efectiveness of the CTrans module, we compared it with OCRNet's object contextual representation (OCR) module [23]. As shown in Table 5, CTrans segmentation is 0.12% higher than OCR. We visualized the segmentation graphs for the CTrans and OCR modules in Figure 5. In the third column of the frst row, the edge of the wine bottle is missing, but our segmentation result is complete. Te human leg in the second row and third column is segmented into a horse, the middle of the cat in the third row and third column is segmented into diferent classes, and a cow in the fourth row and third column is segmented into a horse. As mentioned above, this problem is called the inconsistency problem within a class. To address this issue, we designed the RDF module to efectively handle the intraclass inconsistency issue. Furthermore, we designed the CTrans module to further mitigate intraclass inconsistency. Similarly, the fourth row's third column of cows is mispredicted because of a lack of contextual semantic.
In contrast, the above issue is nonexistent in CTrans segmentation. As shown in Table 5, despite having 6.8 more parameters than OCR, CTrans has a 0.12% higher mIoU.
(1) Combining RDF and CTrans Modules. We cascaded the CTrans and RDF modules as RFT networks to obtain superior segmentation results. Te CTrans module is composed of a depth-separable convolution module, a cross-attention module, a self-attention module, and an FCN Head [1]. In addition, we replaced the transformer's MLP with depthseparable convolution so that the transformer can construct global information and convolutional features. To demonstrate the efects of the two modules, diferent experimental settings (Table 6) were used to show that adding the RDF and CTrans modules improved semantic segmentation. Compared with the dilated FCN, the RDF module improves mIoU by 6.36% and the CTrans module improves mIoU by 6.67%. When both modules are used, semantic segmentation yields a 78.88% improvement. Te results indicate that our method improves semantic segmentation very efectively.   International Journal of Intelligent Systems For further verifcation, we visualized the segmentation maps of the dilated FCN, RDF, and CTrans modules. As shown in Figure 6, the pole in the fourth column of the frst row has a limited number of texture features, so the RDF module is introduced to improve feature diferentiation; then, the CTrans module is used to improve pixel-region representations. Pavement and grass sections of the second and third rows of the fourth column are divided into other categories in the RDF block. In the CTrans module, the same problem exists as in the RDF module. Tis is known as intraclass inconsistency. Tis paper combines RDF and CTrans modules to solve this problem. Te RDF block can extract important information and create information interactions between the channels, whereas the CTrans module improves the pixel representation and creates global pixel relationships based on similar feature maps, which avoids the problem of intraclass inconsistency.       Te val column indicates whether fnely annotated validation set data containing cityscapes was used to train the model.

Comparison with Classical Semantic Segmentation
International Journal of Intelligent Systems the feature maps, and the CTrans module, which builds long-distance pixel dependencies to improve pixel representation and construct global information relations.
To improve the segmentation results, we combined the RDF and CTrans modules into RFTNet. As shown in Figure 1, we compared RFTNet with some classical networks, including ANNet, PSPNet, GCNet, CCNet [55], DANet, and OCRNet. Table 7 shows that our network's segmentation mIoU is 81.9%, which is signifcantly better than that of other methods.
As shown in Figure 1, we visualize the data and fnd that our mIoU metric is 81.9%, which is 0.1% higher than that of OCRNet. In addition, our method has more network parameters than GCNet, but it has a 5% higher mIoU value than GCNet. Terefore, the parameters of our network will be a topic for future studies.
In Figure 7, we visualize the segmentation result graphs of the above networks, where the other methods segment bicycle wheels into diferent classes, while our network segments bicycle completely. Similarly, the wall, truck, and car in rows 2, 3, and 4 are partially segmented into other classes. We combine the RDF and CTrans modules to enhance feature diferentiation frst and then enhance pixelregion representation, which improves these problems greatly. In the fourth and ffth rows, the bars and bicycle front ends have few texture features, making them difcult to segment out. We can, however, segment out the subtle speed bars and bicycle front ends with the help of our method.

Conclusion
Tis paper enhances the category region representation and pixel representation from micro and macro aspects, respectively. Te image-level context information is easily afected by the outside world, and other categories of context information are introduced into the pixel representation, resulting in network misclassifcation. Inspired by this problem, from a macro point of view, this paper proposes RDF module to enhance the representation of channel category region in the feature graph. To further enhance the performance of semantic segmentation, we design the CTrans module from the micro point of view. First, it compacts and enriches the feature map to reduce the computational load of CTrans module. Ten, the similarity between pixels is used to enhance the pixel representation. Finally, the global relationship between the pixels is established. Te method in this paper can accurately segment object categories under the conditions of illumination changes, similar colors, background, and so on. Compared with other methods, our segmentation index and segmentation efect are optimal. However, the method in this paper has some limitations. On the one hand, it is still necessary to improve the ability of the model to segment the boundary of small objects with fuzzy edges. On the other hand, our model was trained, validated, and tested on Cityscapes and PAS-CAL VOC 2012 data sets, which are commonly used. Tey are both refned and annotated data sets. Te generalization ability of the model is not strong enough in the face of images that difer greatly from the two data sets. Terefore, for the method proposed in this paper, we need to test and modify our model on more data sets to maximize our model generalization ability.

Data Availability
Te data that support the fndings of this study are available from the author Tianping Li but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are, however, available from the author on reasonable request and with permission of the author Tianping Li.