J-Net: Asymmetric Encoder-Decoder for Medical Semantic Segmentation

With the development of deep learning, breakthroughs have been made in the field of semantic segmentation. However, it is difficult to generate a fine mask on the same medical images because medical images have low contrast, high resolution, and insufficient semantic information. In most scenarios, existing approaches mostly use a pooling layer to reduce the resolution of feature maps.*erefore, it is difficult for them to consider the whole image features, resulting in information loss and performance degradation. In this paper, a multiscale asymmetric encoder-decoder semantic segmentation network is proposed. *e network consists of two parts, which perform feature extraction and image restoration on the input, respectively. *e encoder network obtains multiscale feature information by connecting multiple ASPP modules to form a feature pyramid. Meanwhile, the upsampling layer of each decoder can be connected to the feature map generated by the corresponding ASPP module. Finally, the classification information of each pixel is obtained through the sigmoid function.*e performance of the proposedmethod can be verified on publicly available datasets. *e experimental evidence shows that the proposed method can take full advantage of multiscale feature information and achieve superior performance with less inference computational cost.


Introduction
Since the proposal of AlexNet [1] by Professor Hinton in 2012, computer vision has made breakthroughs in image classification, target detection [2,3], and semantic segmentation [4]. Alternatively, computer-aided diagnosis systems based on neural network convolution have been widely used in many medical image analysis tasks.
Segmentation is one of the most important and popular tasks in medical image analysis. It plays a vital role in disease diagnosis, surgical planning, and prognostic evaluation. We urgently should improve the efficiency of doctors' diagnoses to save patients' lives. Medical image segmentation methods and theories are many, including borders, thresholds, regional growth, statistics, graph theory, active contour, information theory, fuzzy set theory, and neural network. Due to the large hints of computing power, neural networks have gradually become the preferred technology for semantic segmentation tasks and have achieved excellent results in various competitions.
From [5,6], it can be seen that the features of the bottom layer (such as the output of layers 1, 2, and 3) are more biased toward the basic units of the image, such as points, lines, and edge contours, while the high-level semantic features are layers 4 and 5. It is more abstract and more similar to the semantic information of the image, more like a region. Based on the above understanding, the focus of the semantic segmentation network is how to better combine high-level semantic information with low-level feature information. In medical image segmentation tasks, FCN [6] and U-Net [7] are the mainstream network models. e other architectures [8][9][10][11][12] use different mechanisms (long jump connection, pyramid pooling, and so on) as part of the decoding mechanism. e difference between these architectures lies mainly in the decoder network. e decoder has the task of the encoder to learn distinguishable characteristics (lower resolution) of semantic mapping pixel space (higher resolution) to obtain a dense classification. Semantic segmentation has the ability to not only distinguish the pixel level but also learn the characteristics of the different stages required in the encoder.
In some recent work, RSANet [13] employs residual semantic-guided attention mechanism (RSAM) to fuse the multiscale features from LCNet for improving detection performance efficiently. In ALNet [14], the encoder adopts a novel residual module to abstract feature representations. Swin transformer [15] introduced transformer in semantic segmentation to increase the model's ability to capture longdistance information.
To reduce the computational complexity and improve the training speed, the traditional encoder-decoder must use many downsampling processes, which will cause some small features to be lost in the downsampling process, and it is impossible to accurately classify each pixel. To solve the problem of information loss, we propose a new multiscale fusion method of asymmetric encoder-decoder network (J-Net) to enhance the use of multiscale feature information, while shortening the information flow channel. e contributions of our works are threefold: (i) An asymmetric encoder-decoder network is proposed to further reduce the loss of information in the downsampling process. (ii) Proposing a new connection method between the encoder network and the decoder network can reduce unnecessary information flow. (iii) Experimental results show that our framework better integrates features of different scales and achieves excellent performance with less computational cost.
e rest of this article is organized as follows. In Section 2, several backbone architectures used in modern semantic segmentation are reviewed. In Section 3, the J-Net structure and its concept are proposed. Section 4 compares FCN [4], DeepLab [16][17][18][19], and U-Net [7] and verifies the effectiveness of J-Net. Section 5 summarizes the advantages and disadvantages of J-Net in other fields.

Related Work
Since the proposal of FCN in 2015 [4], convolutional neural network has made considerable progress in the field of image segmentation. It has been shown that the main factor affecting image segmentation is how to expand the local receptive field and keep the loss of features in the process of downsampling. ere are some mainstream technologies to obtain global information as follows (see Figure 1 for illustration).

Upsampling.
e maximum change of FCN [4] compared with classification neural networks is that the classification network will add some convolutional layer at the end of the network so that a two-dimensional feature map can be obtained, followed by softmax to obtain the classification information of each pixel. is is the beginning of using convolutional neural networks to solve semantic segmentation problems. But in this way, direct upsampling only uses high-level semantic information and ignores lowlevel features, which affects the segmentation effect.

Encoder-Decoder.
e encoder-decoder is a concept in the NLP field, not a specific algorithm, but a framework to solve problems. e model consists of two parts: the encoder network (feature extractor) and the decoder network (generator). e image features are extracted by stacking multiple feature extraction blocks (Conv + BN + RELU), and the local receptive field is expanded by repeated downsampling to obtain larger global features. e function of the decoder network (generator) is to generate the high-level feature vector into the target vector. is method also has defects. For input with a large amount of information, the encoder process will lead to the loss of information.
To solve this problem, many researchers have made various attempts. For example, the U-Net [7] uses skipconnect operation, and the feature map of each convolution layer of U-Net will be concatenated to the corresponding upsampling layer. In SegNet [20], the decoder uses the pooled indexes calculated in the max-pooling process to calculate the nonlinear upsampling of the corresponding encoder. In addition to the above two methods, there are other variants, such as using fixed (sparse) index array to sample or use replication upsampling. However, these methods consume more memory, require a longer convergence time, and do not perform well.

Atrous Convolution.
Multiple downsampling will lead to information loss, which will make the network miss smaller targets when performing detection tasks and affect the final results of the network. In DeepLab [16][17][18][19], atrous convolution has been proposed. ere are two functions of atrous convolution: one is to control the receptive field and the other is to adjust the resolution. Firstly, by adjusting the hole convolution rate, the receptive field in the center of the convolution core increases. Secondly, by setting the stride size, the hole convolution can increase the receptive field and reduce the resolution.

Method
To solve the problem of information loss caused by multiple downsampling, we propose an asymmetric multiscale encoder-decoder model. e proposed network is a modified FPN [21]. We change the way to get global features and reduce the proportion of encoders.

Network Architecture.
e network architecture is shown in Figure 2. e multiscale feature fusion asymmetric encoder-decoder network consists of two parts: a larger encoder and a smaller decoder. is asymmetric structure reduces the redundancy of the network structure using ResNet as the encoder. At the same time, ASPP is performed on feature maps of different scales to obtain multiscale information, followed by softmax to obtain the classification information of each pixel. e feature extraction module consists of a 3 × 3 convolutional layer, BN layer, and RELU activation function layer. In the feature recovery module, the input is the feature pyramid generated by the ASPP of two feature maps, which are merged through the concatenated operation as an input to the next feature recovery module. e ASPP operation uses hole convolution to obtain feature pyramids on the same feature map and at the same time uses stride size to control the size of the output feature map unchanged, which is conducive to the fusion of multiscale features. e encoder network uses hole convolution to obtain a feature map of a specific scale and sets different hole convolution rates to obtain a larger local receptive field without losing feature information. ere is no need to maintain the same decoding stage size as the feature extraction stage because the encoder network no longer uses downsampling multiple times.

Asymmetric Encoder-Decoder.
U-Net is one of the earliest algorithms using a full convolution network for semantic segmentation. e symmetrical encoder-decoder structure including the compressed path and extended path used in this paper was innovative at that time, and it affected the design of the following segmentation networks to a certain extent. e symmetrical structure is to fully integrate feature information of different scales. e network architecture is illustrated in Figure 3(b).
In conventional computer vision tasks, compressed image resolution is mainly achieved through a pooling layer or a convolutional layer (stride=2). Similarly, when we do semantic segmentation or target detection, the main purpose of the compressed path is to expand the local receptive field of the convolution kernel to obtain global information. We can use other methods to obtain global    features, such as using a larger convolution kernel or hole convolution.
In this paper, we choose to stack multiple holes in convolution to reduce the use of downsampling. Using this method, the network can obtain a larger local receptive field without losing information, reduce the size of the decoder, and improve the training speed. e network architecture is illustrated in Figure 3(a). e encoder is used as the feature extractor. Different from the use of VGG [22] as the backbone in FPN [21], here we use ResNet [23] as the backbone network. e use of a large number of skip-connect greatly improves the utilization rate of low-level features and can make the encoder network deeper. As a mask generator, the main purpose of the encoder is to restore the highly abstract feature vector to a mask image with the same size as the original image. In order to strengthen the high-level semantic features, we redesigned the encoder. e decoder consists of the repeated application of 3 × 3 convolutions (unpadded convolutions), each followed by a rectified linear unit (RELU) and a batch normalization unit (BN). Following the feature extractor, we use a dropout unit (Dropout) to prevent overfitting. In the final upsampling step, we restore the number of feature map sizes to the same as that of the input.

Multiscale Feature Map.
In the field of object segmentation, one of the most basic principles is that the larger the receptive field of the final predicted pixel, the better the effect of capturing more contextual information and making more accurate predictions.
To obtain a larger receiving field, the mainstream method is to use large convolution kernels, hole convolution, and stack downsampling to reduce feature resolution so that convolution kernels of the same size can obtain larger local receptive fields.
For a large convolution kernel, due to its large size, it consumes several computing resources in network training. It is proposed in AlexNet [1] that multiple small convolution kernels can be connected in series to achieve the same receptive field as a large convolution kernel. However, the effective receptive field obtained using multiple stacked small convolution kernels is different from the theoretical receptive field [17]. Since these convolutions correspond to the difference between the effective receptive field and the theoretical receptive field, the feature information of the detected target is lost.
To solve this problem, we connect multiple atrous convolutions in parallel to form a spatial pyramid (ASPP) [16][17][18][19]. ASPP uses atrous convolutions with different dilation rates to perform different convolution operations on the feature map. It does not increase the number of parameters while obtaining a receptive field that exceeds the size of its convolution kernel. e most important thing is that the size of the feature map has not changed after the hole convolution operation. So, ASPP does not affect the original feature extraction operation of the encoder.

Feature Fusion.
e feature pyramid is currently an important part of the target detection, semantic segmentation, behavior recognition, etc. It has excellent performance for improving model performance. References [9][10][11][12][24][25][26][27] demonstrated various methods for constructing feature pyramids. e feature pyramid has a feature map of different scales. Targets of different sizes can have appropriate feature representations at the corresponding scales. By fusing multiscale information, targets of different sizes can be predicted at different scales. It improves the performance of the model very well. So, the most important thing to determine the mask is to obtain highlevel semantic information (position, category, and so on) and low-level features (shape, color, and so on).
e serial use of ASPP allows us to obtain a richer view and combine feature information of different scales. We perform the atrous convolution operation on the advanced edges of the mask. is multiscale feature fusion shortens the information flow path and at the same time increases the information flow path between the encoder and decoder and finally achieves the repeated use of important features.

Datasets.
e liver segmentation dataset has two sets: training set and test set, and the size of all images in the dataset is 512 × 512. e training set has 400 liver CT images and the corresponding segmentation template, and the verification set has a total of 20 liver CT images and the corresponding segmentation template. ere are two categories of segmentation templates (liver and background).
e EM dataset has two sets: training set and test set, and the size of all images in the dataset is 512 × 512. e training set has 90 EM images and the corresponding segmentation template, and the verification set has 30 EM images and the corresponding segmentation template. ere are two categories of segmentation templates (liver and background).

Implementation Details.
We use the Adam algorithm as an optimizer, and the initial learning rate is set to 0.001. When using the gradient descent algorithm to optimize the objective function when getting closer and closer to the global minimum of the loss value, the learning rate should become smaller to make the model as close as possible to this point, and cosine annealing [28] can be achieved through the cosine function reduce the learning rate. e principle of cosine annealing is as follows: where i is the number of runs (index value); η max and η min , respectively, represent the maximum and minimum values of the learning rate and define the range of the learning rate; T cur indicates how many epochs are currently executed; and T i indicates the total number of epochs in the i − th run.

Experimental Results.
To verify the effectiveness of the method in this paper, the traditional FCN, U-Net, and DeepLab networks were compared, the same data and parameter settings were used for training, and the trained model was verified with the test datasets.
For the comparative experiment [29,30], we chose the Dice coefficient, precision coefficient, and recall coefficient as the evaluation criteria to measure the quality of the model. ese evaluation criteria are as follows: where the R seg represents the predicted segmentation result and R gt represents the segmentation result of ground truth. When applied to a binary segmentation task, it evaluates the degree of overlap between the predicted value R seg and the true value R gt .
where TP (true positive) represents predicting the positive class as a positive class number and FP (false positive) represents predicting the negative class as a positive class number. Precision indicates how many of the samples whose predictions are positive are truly positive samples.
where TP (true positive) represents predicting the positive class as a positive class number and FN (false negative) represents predicting the positive class as a negative class number. Recall rate indicates how many positive examples in the sample are predicted correctly. For fair comparison, all the baselines are performed using the same hardware platform with a single NVIDIA GTX 3080 GPU. e minimum batch size is set to 4 (4 images per GPU); to stabilize the training at the beginning, the number of warmup iterations has been extended from 30 to 50. Dice loss [12] is used as the loss function, Adam is used as the optimizer, the initial learning rate is 0.001, the minimum batch size is 4, and the epoch is 300. We need to consider overfitting when choosing an encoder network, so dropout [10] has been used to improve the generalization of the network. Tables 1 and 2 compare our J-Net with selected state-of-the-art networks. e results in Table 1 indicate, for the liver dataset, that the encoder-decoder model is substantially more accurate than other segmentation models. We compare our method with FCN, DeepLab-v3+, and U-Net in liver segmentation tasks. e precision coefficient and recall coefficient of J-Net are 0.8836 and 0.9678 which are better than those of FCN, Deeplab-v3+, and U-Net, and J Net's DICE coefficient of 0.9129 is slightly lower than U-Net's DICE coefficient of 0.9172. However, the gap is not big. On the whole, the performance of J-Net is better than the that of above three algorithms. ese all prove the effectiveness of our encoderdecoder architecture. e segmentation results of liver dataset by different networks are shown in Figure 4. e figure shows the visual comparison on liver val set. From left to right are input  Security and Communication Networks images, ground truth, and segmentation predicted from J-Net (Figure 4(a)), U-Net (Figure 4(b)), FCN (Figure 4(c)), and DeepLab-v3+ (Figure 4(d)). Table 2 shows the results for the EM datasets. We compare our method with FCN and DeepLab-v3+ in EM segmentation tasks. It can be seen from Table 2 that the Dice coefficient of J-Net is 0.9376 and that of other two networks is 0.9185 and 0.9364. On the whole, the performance of J-Net is better than that of the above two algorithms. e segmentation results of EM datasets by different networks are shown in Figure 5. Figure 5 shows the visual comparison on EM val set. From left to right are input  images, ground truth, and segmentation outputs from DeepLab-v3+ ( Figure 5(a)), FCN ( Figure 5(b)), and J-Net ( Figure 5(c)).

Conclusions
is article analyzed previous works on medical image segmentation, proposed a new architecture (J-Net), and discussed the effect of the information flow path on feature extraction. By connecting ASPP modules in series, changing the encoder network, and reducing the size of the decoder network, an asymmetric encoder-decoder network is designed.
When faced with a complex boundary in segmentation, there is a situation of unstable training (frequent loss fluctuations), mainly because J-Net pays too much attention to high-level semantics and low-level features and neglects to reuse other features. A large number of experiments on the challenging liver datasets and EM datasets have proved the effectiveness of our method. e strategies proposed in this work may be extended to other medical imaging applications and even routine computer vision tasks.

Data Availability
e data used to support the findings of this study are available online.