BA-GCA Net: Boundary-Aware Grid Contextual Attention Net in Osteosarcoma MRI Image Segmentation

Osteosarcoma is one of the most common bone tumors that occurs in adolescents. Doctors often use magnetic resonance imaging (MRI) through biosensors to diagnose and predict osteosarcoma. However, a number of osteosarcoma MRI images have the problem of the tumor shape boundary being vague, complex, or irregular, which causes doctors to encounter difficulties in diagnosis and also makes some deep learning methods lose segmentation details as well as fail to locate the region of the osteosarcoma. In this article, we propose a novel boundary-aware grid contextual attention net (BA-GCA Net) to solve the problem of insufficient accuracy in osteosarcoma MRI image segmentation. First, a novel grid contextual attention (GCA) is designed to better capture the texture details of the tumor area. Then the statistical texture learning block (STLB) and the spatial transformer block (STB) are integrated into the network to improve its ability to extract statistical texture features and locate tumor areas. Over 80,000 MRI images of osteosarcoma from the Second Xiangya Hospital are adopted as a dataset for training, testing, and ablation studies. Results show that our proposed method achieves higher segmentation accuracy than existing methods with only a slight increase in the number of parameters and computational complexity.


Introduction
Osteosarcoma is the most common malignant bone tumor which occurs most frequently in children and adolescents between the ages of 10 and 30 years, with the highest incidence during the adolescent growth spurt [1]. In the past few years, neoadjuvant chemotherapy and biosensors have greatly developed, making the treatment of osteosarcoma easier. Nevertheless, without an early diagnosis, patients with advanced osteosarcoma will develop metastasis and recurrence disease, whose 5-year survival rate still keeps less than 20% [2,3]. erefore, how to clearly and accurately diagnose osteosarcoma has become the key to prevention and treatment.
Magnetic resonance imaging (MRI) can display the structure of soft tissue clearly and has higher contrast as well as resolution than other imaging methods [4], which makes the tumor area easier to distinguish. It is considered to be the best imaging method to evaluate the relationship between the primary osteosarcoma lesion and its surrounding areas [5]. Traditionally, the diagnosis of osteosarcoma is based on manual histopathological analysis with biosensors on MRI images by doctors. However, it has great disadvantages. In developing countries where the medical level is relatively backward, the doctor-patient ratio remains low, with each doctor handling the diagnosis and treatment of about 60 patients per day on average [6][7][8]. In addition, one patient will produce more than 600 MRI images during one diagnosis with biosensors, making analysis laborious and time-consuming [9][10][11][12][13][14]. To make matters worse, doctors' high-intensity work makes their manual judgments susceptible to inter-and intra-observer variations and results in inaccurate segmentation of osteosarcoma areas [15][16][17]. Furthermore, due to the heterogeneity of osteosarcoma [1], sophisticated diagnoses using MRI biosensors often require experienced radiologists, which is extraordinarily challenging for some developing countries with backward allocation of medical resources [18][19][20].
In order to solve the problems of manually segmenting lesions by doctors, researchers have designed a variety of automatic segmentation models and applied them to medical image segmentation to help doctors diagnose diseases and predict lesion areas, thus reducing the pressure on medical resources in developing countries and enabling diagnosis of diseases to achieve high accuracy at low computational costs. A fully convolutional network (FCN) [21] uses skip layers to achieve end-to-end and joint learning of semantic as well as location. It is the most classic model in medical image segmentation. U-Net [22] crops the output feature maps from shallow layers and concatenates them to the ones from deep layers to fuse and exploit the low-level and high-level features, improving the network's performance on neural structure and cell segmentation. In the specific field of osteosarcoma segmentation, [23] uses a recurrent convolutional neural network (RCNN) combining CNN and GRU and achieves better performance with a small number of histopathological osteosarcoma images. MSFCN [24] and MSRN [25] add multiple supervised structures to the network to promote learning and improve the overall osteosarcoma segmentation accuracy.
Considerable progress has been made in the research on osteosarcoma segmentation models. However, a few segmentation problems in MRI images using biosensors have unfortunately been overlooked: (i) e boundaries between osteosarcoma and normal tissues in some images are not clear enough and the lesion area is indistinguishable from other soft tissues. erefore, the low-contrast boundaries may be blurred during convolution operations, resulting in segmentation failure. (ii) In transverse section images, the osteosarcoma area is often small, and the model is prone to spatial shift in the process of down-sampling and upsampling, which leads to difficulty in localization and decrease in accuracy. (iii) Some osteosarcoma images have complicated and irregular shape boundaries, and the model cannot identify small gaps between osteosarcoma and normal tissues, causing the identification of the entire region as a lesion area and the loss of segmentation details. ese problems contribute to poor performance on a number of difficult tasks with vague foreground-background boundaries or small and complex osteosarcoma regions, which have become a significant factor that affects the accuracy of segmentation.
In order to solve the problems mentioned above, we propose a novel boundary-aware grid contextual attention network (BA-GCA Net), which effectively improves the performance of the network on MRI images with blurred osteosarcoma boundaries and complex foreground structure. First, we propose a plug-and-play grid contextual attention structure. e structure splits the input feature map into patches, exploits the local contextual information to learn the positional features inside the image patches, and enhances the network's capability to capture the details of osteosarcoma boundaries and texture. For the problem that certain osteosarcoma MRI images using biosensors have intricate shape boundaries or fuzzy tumor texture, a statistical texture learning block (STLB) is integrated into the network. STLB learns the low-level features and applies them to the task. e texture enhancement module (TEM) in the STLB first enhances the texture in the low-level feature map and produces a clearer texture map, which is conducive to more accurate segmentation of the osteosarcoma region.
en the pyramid texture feature extraction module (PTFEM) is used in the STLB to further extract and utilize the enhanced texture. Due to the small tumor areas in some osteosarcoma MRI images, the subtle spatial shift of the prediction map will lead to poor segmentation performance. To this end, we use a spatial transformer block (STB) in the network to make it invariant to spatial shifts. STB localizes and regresses the input feature map, learns an affine transformation matrix, and applies the transformation to the feature map. It spatially adjusts the prediction map so that the positioning of osteosarcoma areas is more accurate.
In general, our contributions can be summarized as follows: (1) We propose a novel BA-GCA Net, which can better learn detailed features in the input image and fully exploit texture features in a spatially invariant way to improve the segmentation accuracy of osteosarcoma MRI images. (2) In order to pay more attention to local details in the input image, we propose a plug-and-play grid contextual attention (GCA) structure, which reshapes the image into patches and applies local and global contextual attention to them to enhance the perception of local details in osteosarcoma areas. (3) Inspired by U-Net, low-level features in the input image have rich texture details and therefore a statistical texture learning block (STLB) is used to learn texture features in the low-level feature map and utilize them in deeper layers of the network, improving the segmentation accuracy in tasks where the tumor area has blurred boundaries or small gaps. (4) To improve the model's ability in locating the osteosarcoma lesion area, we use an STB in the network to learn an affine transformation matrix and adjust the prediction map. Moreover, a boundary loss is designed in the loss function to facilitate STB to learn positional boundary information, which promotes the segmentation performance on images where the region of osteosarcoma is small. (5) Over 80,000 MRI images of osteosarcoma output by biosensors from the Second Xiangya Hospital are adopted as a dataset for experiments. Results show that our proposed method outperforms other models on accuracy with only a slight increase in the number of parameters and computational complexity compared with the backbone network and achieves a balance in terms of accuracy and computational efficiency, which is helpful to doctors in judging the osteosarcoma lesion area and reducing workload.

Related Works
2.1. Osteosarcoma Image Segmentation. Accurate diagnosis and prediction of osteosarcoma are the keys to increasing the survival rate of the patients and making precise follow-up treatment plans. Numerous researchers have studied in osteosarcoma image segmentation before. Reference [26] uses similarity mapping and slope value to analyze the timeintensity curves of regions of interest (ROI) and fuses the anatomic information of traditional MRI sequences with the numerical information of dynamic MRI sequences to obtain a better description of osteosarcoma regions. Reference [27] proposes a dynamic clustering algorithm DCHS based on Harmony Search (HS) and Fuzzy C-means (FCM) to automatically segment osteosarcoma MRI images, using a subset of Haralick texture features and pixel intensity values as a feature space to DCHS to delineate tumor volume, achieving a Dice measurement of an average of 0.72. With the rapid development of deep learning and computer vision [28], a great number of deep learning-based models have been designed by researchers in the segmentation of osteosarcoma images as auxiliary diagnosis methods. Multiple supervised fully convolutional network (MSFCN) [24] adds supervision layers to the output layers of different sizes in the VGG model and uses the output information of the multiple supervision layers to produce the prediction map. Multiple supervised residual network (MSRN) [25] integrates residual structure into the network on the basis of multiple supervised structure, which improves the performance of a deep neural network on osteosarcoma image segmentation tasks. Wnet++ [29] uses two cascaded U-Nets and dense skip connections to realize automatic segmentation of tumor areas. In addition, Wnet++ adopts multi-scale input to alleviate information loss caused by down-sampling and introduces an attention mechanism to better represent tumor features, which increases the accuracy of segmentation. In osteosarcoma MRI image segmentation, there are often blurred foregroundbackground boundaries or small and complex segmentation regions. erefore, the local semantic details and boundary information are of particular importance. Different from the above methods, we design BA-GCA Net, which embeds modules into the semantic segmentation framework to enhance the model's ability to extract rich local semantic, texture statistics, and boundary information and improves the model in performance on intricate segmentation tasks.

Boundary Prediction Enhancement Methods.
In osteosarcoma MRI image segmentation, the prediction of boundaries is crucial to the model's performance. Previously, a great number of researchers in the field of medical image segmentation have devoted to solving the problem of segmentation boundaries [23][24][25][26][27][28]. Structure boundary preserving segmentation [30] obtains the structured boundary information of an image through a key point selection algorithm, a boundary preserving block, and a shape boundary-aware evaluator. BFP [31] utilizes a boundary-aware feature propagation module to transfer low-level boundary information. InverseForm [32] enables boundary loss function to learn spatial transformation distance through a pretrained inverse transformation network. Some other works [33][34][35] have improved the boundary loss function and achieved good results.
Unfortunately, the above methods neither take full advantage of the rich low-level statistical boundary texture features of the input image nor solve the problem of spatial shift of the prediction boundary that may exist in small osteosarcoma lesion areas. Unlike the above methods, we use a statistical texture learning block (STLB) [36] to quantify and count the low-level texture information output by the shallow layers in the network. Due to the segmentation of some small tumor areas in the osteosarcoma MRI images, spatial shifts may occur during down-sampling and upsampling. erefore, we integrate the STB [37] into the deep layer of the network to enhance the spatial transformation invariance. Combined with the boundary loss function, STB will automatically learn the spatial shift of the prediction map and adjust it adaptively.

Attention Mechanisms.
Attention mechanism has been proved to raise the model's capability of giving more weight to useful features to improve semantic analysis and has achieved good results in a variety of computer vision tasks [38][39][40][41][42].
SENet [38] compresses feature maps in spatial dimension and generates a channel-wise attention. Based on SENet, CBAM [39] additionally introduces a spatial attention through channel pooling and large-scale convolution and has certain improvements in classification and detection. SANet [40] divides the segmentation task into two subtasks that are pixel-level prediction and pixel grouping and combines multi-scale prediction and pixel-grouping spatial attention to improve performance. OCR [41] learns the relationship between pixel and object region features based on coarse segmentation maps and enhances the description of pixel features. Coordinate attention (CA) [42] rethinks the attention mechanism and produces the attention map with positional information by compressing the spatial features of the image into attention weights in horizontal and vertical directions. e above methods can extract image context in an efficient way, but cannot pay extra attention to the local details of the image that are indispensable for pixel-to-pixel osteosarcoma MRI image segmentation. Different from the above methods, our proposed grid contextual attention (GCA) combines local and global contextual attention, which can exploit the global contextual features of the feature map and learn local contextual features in the meantime.
Computational Intelligence and Neuroscience 3

Methods
As is mentioned above, the shortage of medical resources and the backward medical level in some developing countries make diagnosis of osteosarcoma more formidable. Moreover, the blurred and low-contrast tumor areas as well as small and intricate structural boundaries may lead to fuzzy or even wrong prediction of the lesion region from automatic segmentation models and influence clinical diagnoses. To this end, we add GCA, STLB, and STB to the network to reduce segmentation errors in osteosarcoma MRI images using biosensors while keeping the number of parameters and computational complexity at a low level to ensure low diagnostic cost. e overall structure of the network is shown in Figure 1.
For an osteosarcoma image generated by MRI biosensors, it is first fed into the backbone to extract high-level and low-level features. We integrate GCA at the top of the ResNet building blocks after the first two layers to improve the feature extraction ability of the model. e low-level features produced by the first two layers are fed into STLB to enhance and analyze texture statistics. e output of STLB is concatenated in channel dimension with the high-level features from the backbone added with GCA. en the fused output is fed into STB to perform an affine transformation to the prediction map and produce the final output. Canny [43] operator is applied to the prediction result and ground truth to extract the boundaries. Segmentation loss and boundary loss are calculated using segmentation masks and boundaries, respectively, to form the compound loss function.

Grid Contextual Attention.
In the pixel-wise osteosarcoma MRI image segmentation, understanding the local details in the image often helps in more accurate segmentation as well as less uncertain prediction. erefore, we design a grid contextual attention (GCA) structure based on both local and global contextual features. e structure is shown in Figure 2.
For an input feature map X ∈ R C×H×W , its global contextual information is obtained by: where A h ∈ R C×H×1 , A w ∈ R C×1×W , Avg h , and Avg w denote average pooling of the feature map in height and width directions, respectively, and Excit denotes the activation transformation of the input as: In the part of local attention, GCA splits the feature map into patches, each of which is denoted as P i,j ∈ R C×P h ×P w , where P h and P w represent the patch size in height and width directions and i ∈ 1, 2, . . . , (H/P h ) , j ∈ 1, 2, . . . , (W/P w ) .
Each patch P i,j is passed through average pooling in the width and the height directions respectively to get local attention A h i,j ∈ R C×P h ×1 and A w i,j ∈ R C×1×P w . en the patches are concatenated by: where where × denotes matrix multiplication in spatial dimension. Finally, A m i,j and P i,j are applied element-wise product to get the reweighted patches and the patches are concatenated, obtaining the output feature map X out ∈ R C×H×W .
Compared with coordinate attention (CA) [42], GCA can better learn the local detailed features in the osteosarcoma MRI images while maintaining the global semantic features, which improves the segmentation accuracy in some blurred tumor images. In addition, to flexibly adjust the patch size and compensate for the loss of information between patches by using different patch sizes, padding-crop operation is designed in GCA. rough adaptive padding, the input feature map can be divided into patches of any size and be cropped back to the original input size after the attention operation.

Statistical Texture Learning Block.
In osteosarcoma segmentation tasks, the rich contextual information contained in low-level features plays a crucial role in segmentation performance. To solve the problem of blurred boundaries as well as complex and irregular tumor shapes in osteosarcoma MRI images using biosensors, we use statistical texture learning block (STLB) [36] to fully exploit and utilize the texture features and combine the rich low-level features with the high-level features in the deeper layer of the network. SFNet can combine low-level and high-level features with semantic flow. At the same time, STLB can explore the statistical features of osteosarcoma image texture [44,45]. It not only learns the structural texture information, but also learns the global statistical information of the image, which is helpful for osteosarcoma segmentation. In this section, the 1d and 2d quantization and counting operator (QCO) are first introduced for statistical description of the texture information. en two modules in STLB are introduced: the texture enhancement module (TEM) based on 1d-QCO to enhance the osteosarcoma texture features and the pyramid texture feature extraction module (PTFEM) based on 2d-QCO to further exploit the texture features.
For an input feature map X ∈ R C×H×W , 1d-QCO applies global average pooling to X to get the average feature a ∈ R C×1×1 . en the cosine similarity between each pixel X i,j in X and a is calculated to get S ∈ R 1×H×W , where i ∈ 1, 2, . . . , H { } and j ∈ 1, 2, . . . , W { }. Each position S i,j is denoted as: en S is reshaped to S ∈ R HW and quantized to obtain the N levels L � [L 1 , L 2 , . . . , L N ]. e nth level L n is written as: where N is a hyperparameter and n ∈ 1, 2, . . . Computational Intelligence and Neuroscience e encoding map E ∈ R N×HW consists of each pixel's encoded value. Compared with one-hot encoding or argmax operation, the quantization encoding is smoother and robust to gradient vanishing.
1d-QCO then applies counting operation to the encoding map E to get the counting map M ∈ R N×2 . Concretely, M is calculated by: where Concat denotes concatenate operation in channel dimension. ereafter, the average feature a is up-sampled to a ∈ R N×C and concatenated to the up-sampled M to produce P ∈ R N×C 1 , which is calculated by: e output of 1d-QCO includes the encoding map E as well as the statistical texture information P of osteosarcoma.

2d-QCO.
1d-QCO contains the statistical texture features of the osteosarcoma images. However, it cannot learn positional relationships between pixels. erefore, a 2d-QCO is proposed.
Similar to 1d-QCO, 2d-QCO calculates cosine similarity and level encoding of the input feature map X ∈ R C×H×W to get the encoding map E ∈ R N×HW and quantization levels L.
en E is reshaped to E ∈ R N×1×H×W . For the encoding of each adjacent pixel pair E i,j ∈ R N×1 and E i,j+1 ∈ R N×1 , the encoded value E i,j ∈ R N×N that carries adjacent information is calculated by: where T and × denote matrix transpose and multiplication, respectively. en we get the encoding map E∈ R N×N×H×W that contains adjacent features of the input.
In the counting process, the counting map M ∈ R N×N×3 is produced by: where L∈ R N×N×2 represents the pairwise combination of all the quantization levels and L m,n � [L m , L n ].
In 2d-QCO, the average feature is written as a ∈ R N×N×C and the final output P ∈ R N×N×C 1 is obtained by:

Texture Enhancement Module.
e low-level texture such as structural boundaries in osteosarcoma images are often blurred and of low contrast. To this end, a texture enhancement module (TEM) is employed to sharpening the structural texture and make the low-level features easier to learn. e structure of TEM is shown in Figure 4.
Inspired by the histogram quantization method in traditional image processing algorithms, the statistical information in each quantization level is treated as a node in the graph adjacency matrix. Unlike the traditional method of defining a diagonal matrix artificially, TEM uses graph reasoning to construct the adjacency matrix and reconstruct  6 Computational Intelligence and Neuroscience the quantization level L to get L ′ . Concretely, the process can be written as: Finally, the output O ∈ R C 2 ×H×W is obtained using the reconstructed quantization levels L ′ and the encoding map E. O ∈ R C 2 ×H×W is denoted as: where Reshape represents reshaping the output to O ∈ R C 2 ×H×W .

Pyramid Texture Feature Extraction
Module. e features of statistical texture in osteosarcoma MRI images are effectively enhanced through TEM. en a pyramid texture feature extraction module (PTFEM) is proposed to extract and exploit rich texture features of the boundaries. e structure of PTFEM is shown in Figure 5. Inspired by the conventional gray-level co-occurrence matrix algorithm, the input feature map of osteosarcoma is first passed through 2d-QCO to get the statistical co-occurrent features P ∈ R C×N×N and then the texture features T ∈ R C′ is calculated using a MLP and a level-wise average operation. Specifically, the process is denoted as: Some previous works such as FPN [46] and DeepLabV3+ [47] found that the employment of multi-scale structure can improve the model's performance. Inspired by these works, PTFEM integrates 2d-QCO with different scales into the structure to better extract and utilize the osteosarcoma texture features.

Spatial Transformer Block.
Due to the small and complicated osteosarcoma lesion areas in some MRI images produced by biosensors, even a slight spatial shift of the prediction map can produce poor results, which in turn lead to wrong diagnosis. To this end, we use an STB [37] to make the backbone invariant to spatial transformation and more robust to the osteosarcoma images with small and intricate tumor regions.
For the high-resolution segmentation map of osteosarcoma, we assume that the error of the map to the ground truth can be reduced by homography transformation. erefore, we use STB to learn this spatial transformation. e structure of STB is shown in Figure 6.
For an input feature map X ∈ R C×H×W , STB uses a set of down-sampling convolutions F down− sample and fully connected layers F regression to produce an affine transformation matrix M affine ∈ R 2×3 . Concretely, M affine is denoted as: Simultaneously, the input feature map is applied a 1 × 1 convolution as well as a softmax activation in another branch to get the initial prediction map pred ∈ R 2×H×W , which can be described as: pred � Softmax(Conv(X)).
We denote the affine matrix as M affine � a 11 a 12 a 13 a 21 a 22 a 23 .
For the coordinates (x s n , y s n ) of each pixel in the initial prediction map pred and the coordinates (x t n , y t n ) of each pixel in the final prediction map pred ′ ∈ R 2×H×W , where n ∈ 1, 2, . . . , HW { }, the affine transformation is defined as: In order to apply spatial transformation to the initial prediction map, STB samples each (x s n , y s n ) to obtain the final output pred ′ . Specifically, the process can be represented as:

Computational Intelligence and Neuroscience
where I denotes to bilinear interpolation, Φ x and Φ y denote sampling parameters, and c ∈ 1, 2, . . . , C { } represents the channel index of the feature map.

Compound Loss Function.
To improve both the prediction accuracy and the boundary perception ability of the model for osteosarcoma, we design a compound loss function based on focal loss [48]. e loss function consists of a segmentation loss as well as a boundary loss.

Weighted Segmentation Focal Loss.
For the prediction map y pred and the ground truth y gt , the weighted segmentation focal loss is defined as: where α and c represent the balance weight and the exponential hyperparameter, respectively.

Weighted Boundary Focal Loss.
In osteosarcoma MRI image segmentation, the boundary plays an essential role in improving the segmentation performance. erefore, we introduce a weighted boundary focal loss to facilitate the model to learn boundary information. First, the segmentation head and the ground truth are applied Canny [43] operator to produce the prediction and ground truth boundary b pred and b gt . en we perform edge sharpening on the normalized b gt with a threshold of 0.5 and obtain a clear ground truth boundary b gt ′ . e weighted boundary focal loss is calculated using b pred and b gt ′ as: e compound loss function is defined as: where β is the hyperparameter of weight. After experiments, one of the suitable values of β is 0.2, which is used in this article. e compound loss function enables the model aware of the boundaries of osteosarcoma and also promotes STB to learn the spatial transformation between the prediction output and the ground truth, which in turn makes the model more robust to tumor segmentation. BA-GCA Net is trained in combination of compound loss function, enhancing the model's ability to boundary localization. As a method to

Experiments
In this section, we employ over 80,000 MRI images of osteosarcoma output by biosensors from 204 cases as the dataset for experiments to evaluate the model and perform ablation studies, which are provided by the Ministry of Education Mobile Health Information-China Mobile Joint Laboratory and the Second Xiangya Hospital of Central South University [49].

Dataset.
We have collected statistics about patients and the results are shown in Table 1. We randomly select 80% of the images for training and the remaining 20% for evaluation. To be specific, there are a total of 204 case samples, of which 164 are in the training set and 40 are in the test set. Due to the confidentiality of data between hospitals and the privacy of patients, the dataset is relatively hard to obtain, which leads to the overfitting problem of the model. To promote the robustness of the model to new data, we perform data augmentation on the training set. We rotate the images at 3 angles (0, 90, and 180), flip the image on different axes (no flip, up-down, and left-right), perform Gaussian blurring, add Gaussian noise (using different variances), and apply salt and pepper noise (using different proportions) to augment the training set.

Evaluation Metrics.
In order to evaluate the performance of the model on osteosarcoma MRI image segmentation, in this section, we introduce accuracy, precision, recall, F1-score, Dice similarity coefficient (DSC) [50], and Intersection of Union (IOU) as the evaluation metrics and the confusion matrix with true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN) to explain the performance of the model [51]. e evaluation metrics are defined as follows: Accuracy (Acc) is used to evaluate the proportion of the model's right prediction and it is denoted as [52]: precision (Pre) is to calculate the percentage of true osteosarcoma areas among the prediction areas [53]. Precision can be written as: recall (Rec) evaluates the percentage of the prediction osteosarcoma areas among the true areas [54]. Concretely, recall is denoted as: F1 indicates the robustness of segmentation and is defined as [55]: For a simpler description of DSC as well as IOU, we denote the prediction of tumor area as y pred and the ground truth as y gt .
DSC represents the similarity between y pred and y gt . DSC can be written as DSC � 2 y pred ∩ y gt IOU measures the degree of overlap between the model's prediction map and the ground truth and it is denoted as [56] Furthermore, we adopt #params as the number of parameters of the model and use floating point operations (FLOPs) to evaluate the computational complexity [57,58].

Training Details.
e Dilated ResNet-D-22(DRN-D-22) [52] is chosen as the backbone and BA-GCA Net is designed based on it. Note that osteosarcoma MRI images often have large individual differences, and the relationship between pixels in one image plays a more crucial role than that among images. To this end, we replace batch norm in the backbone to layer norm. Other hyperparameters of the model are shown in Table 2. Figure 7 shows the performance of each model on osteosarcoma MRI image segmentation. Column (A) represents the original image, columns (B)-(K) are the prediction output of each model, and column (L) is the ground truth. Note that BA-GCA Net outperforms other models in some difficult tasks, such as the second image that has low contrast and blurred boundaries, the fourth image of transverse section with small and complicated tumor area, and the last image that contains small gaps in the region of osteosarcoma. Results show that our proposed BA-GCA Net has better performance than some   latest methods such as DeepLabV3 and UNet++ in capturing boundary details and recognizing blurred osteosarcoma lesion areas. Furthermore, BA-GCA Net is more robust to segmentation and DSC of its prediction remains above 0.93 in difficult segmentation tasks. Compared with other models, BA-GCA Net shows an advantage in processing the lowcontrast and complex osteosarcoma images and localizing the small tumor regions, which is helpful for clinical diagnosis. e quantitative evaluation results of BA-GCA Net and other comparative models on the test set are shown in Table 3.

Comparison with Other Methods.
From the results we know that BA-GCA Net achieves higher precision, F1-score, DSC, and IOU than latest methods, which means our proposed method performs better overall on the test set. By integrating GCA, STLB, and STB into the backbone, the DSC of the model has increased by 0.004, 0.003, and 0.011 and the IOU of the model has increased by 0.023, 0.007, and 0.007, respectively. It is proved that the three blocks effectively strengthen the backbone with an increase of DSC by 0.018 and increase of IOU by 0.037. More detailed analysis will be introduced in the Ablation Study section.   e comparison of FLOPs and DSC between models is shown in Figure 9. We can know from the results that the computational cost of our proposed method is higher than DRN added with CA by 72.49GFLOPs and lower than U-Net by 10.46GFLOPs, which achieves a good balance between the accuracy and the computational complexity.

Ablation Study
In this section, ablation studies on the three blocks are introduced respectively in order to better analyze the role as well as the necessity of each block in BA-GCA Net. By comparing the performance of the model with and without the block, and visualizing the outputs in the middle layers, we can check out whether a block plays the role as expected.

Ablation of GCA.
Attention mechanism is able to calibrate the input feature map so that the model can focus on the regions of interest. To illustrate that GCA can learn this calibration more precisely, in this section, we employ Seg-Grad-CAM [61] to visualize GCA as well as CA [42]. By summing the partial derivatives of the output target region with respect to the feature maps before and after the last GCA block (or CA block), respectively, and taking the mean of the derivatives for each channel as the weight of the feature map, we can visualize the influence of the attention block. e visualization results of GCA and CA are shown in Figure 10. Yellow and red indicate that the model gives higher weight to the region, while green and blue are the opposite. e visualization only computes the partial derivatives of the osteosarcoma region in the ground truth. e images show that GCA is more sensitive to the osteosarcoma areas and locates the tumor regions more precisely.

Original image
Before GCA Before CA After GCA After CA Ground truth Compared with CA, our proposed structure can better observe the tumor details and calibrate the feature map. e DSC and IOU indicators and number of parameters of DRN, DRN with CA, and DRN with GCA are shown in Table 4, respectively. Compared with CA, the number of parameters of GCA is only 0.05 M more, but the DSC and IOU have increased by 0.003 and 0.015, respectively.

Ablation of STLB.
STLB provides an effective method for better extracting and utilizing low-level texture features of the input images. A key role in STLB is the texture enhancement module (TEM). To analyze what TEM learns during training, we visualize the input and output feature maps of the TEM, mapping the grayscale values to colors from blue to red, which is shown in Figure 11. Note that through quantization, counting, and texture enhancement, the boundaries and some texture of the tissues such as bones, muscles, and osteosarcomas are exploited and sharpened to produce clearer feature maps, which is helpful for PTFEM to extract the spatial correlation features between pixels. Table 5 shows the changes of DSC, IOU, and number of parameters of the model before and after adding STLB. Compared with DRN-D-22, the DRN added with STLB achieves an increase in DSC and IOU by 0.005 and 0.019, respectively, and the number of parameters increases by only 0.35 M. STLB enhances and utilizes the texture details and statistical features of osteosarcoma images, thereby improving the performance on low-contrast and blurred tumor segmentation tasks.

Ablation of STB.
As is mentioned above, STB learns to perform affine transformation on the prediction map to better locate the tumor area. In this section, we examine the segmentation effect before and after affine transformation, as shown in Figure 12. Column (D) shows the comparison of boundaries before and after STB. Canny filter is used to extract the boundary. e red one is the output boundary before STB, and the blue one is the opposite. From the changes of DSC, we can conclude that STB performs beneficial spatial transformation on the input feature map, Original image Before TEM After TEM Figure 11: Visualization of feature maps before and after TEM. making the prediction area more accurate. By integrating STB into the model, the spatial shift and deformation caused by down-sampling and up-sampling are corrected, and the DSC and IOU has increased by 0.011 and 0.007, respectively.

Conclusion
In this article, we use a novel GCA, STLB, and STB to improve the model's segmentation performance on some difficult tasks such as complex osteosarcoma boundaries, small tumor areas, and low-contrast images produced by MRI biosensors. We propose a BA-GCA Net with the three blocks and employ over 80,000 MRI images of osteosarcoma from the Second Xiangya Hospital in China to train and test the model. In order to check out the function of each block, we conduct ablation studies. e visual analysis of the results helps us understand how each block works and its effectiveness. e test results show that out proposed BA-GCA Net achieves 0.927 DSC and 0.880 IOU, which is better than other existing models. e number of parameters and computational cost are only 19.88 M and 149.70GFLOPs, respectively, which means the model reaches a balance between accuracy and computational consumption. e model can assist doctors in judging the area of osteosarcoma at a relatively low cost, reduce the workload of doctors, and improve the efficiency of diagnosis.
In the future, in view of the difficulty in obtaining the clinical data of osteosarcoma, we will introduce few-shot learning into our method, so that the model can use fewer samples to obtain similar results. It will solve the problem of insufficient generalization of hospital self-trained models due to the data incompatibility between hospitals, improve the robustness of the model, and reduce training costs.

Data Availability
e data used to support the findings of this study are currently under embargo while the research findings are commercialized. Requests for data, 12 months after publication of this article, will be considered by the corresponding author.  Figure 12: Visualization of prediction boundaries before and after STB.