A Multiscale CNN-CRF Framework for Environmental Microorganism Image Segmentation

To assist researchers to identify Environmental Microorganisms (EMs) effectively, a Multiscale CNN-CRF (MSCC) framework for the EM image segmentation is proposed in this paper. There are two parts in this framework: The first is a novel pixel-level segmentation approach, using a newly introduced Convolutional Neural Network (CNN), namely, “mU-Net-B3”, with a dense Conditional Random Field (CRF) postprocessing. The second is a VGG-16 based patch-level segmentation method with a novel “buffer” strategy, which further improves the segmentation quality of the details of the EMs. In the experiment, compared with the state-of-the-art methods on 420 EM images, the proposed MSCC method reduces the memory requirement from 355 MB to 103 MB, improves the overall evaluation indexes (Dice, Jaccard, Recall, Accuracy) from 85.24%, 77.42%, 82.27%, and 96.76% to 87.13%, 79.74%, 87.12%, and 96.91%, respectively, and reduces the volume overlap error from 22.58% to 20.26%. Therefore, the MSCC method shows great potential in the EM segmentation field.


Introduction
Environmental pollution is an extremely serious problem in many countries. Therefore, many methods to deal with environmental pollution are constantly being put forward. The methods of eliminating environmental pollution can be divided into three major categories: chemical, physical, and biological. The biological method is more harmless and well efficient [1]. Environmental Microorganisms (EMs) are microscopic organisms living in the environment, which are natural decomposers and indicators [2]. For example, Actinophrys can digest the organic waste in sludge and increase the quality of freshwater. Therefore, the research on EMs plays a significant role in the management of pollution [3]. The identification of EMs is the basic step for related researches.
Generally, there are four traditional types of EM identification strategies. The first one is the chemical method, which is highly accurate but often results in secondary pollution of chemical reagent [4]. The second strategy is the physical method. This method also has high accuracy, but it requires expensive equipment [4]. The third is the molecular biological method, which distinguishes EMs by sequence analysis of genome [5]. This strategy needs expensive equipment, plenty of time, and professional researchers. The fourth strategy is the morphological observation, which needs an experienced operator to observe EMs under a microscope and give the EM identities by their shape characteristics [1]. Hence, these traditional methods have their respective disadvantages in practical work.
The morphological method has the lowest cost of the above methods, but it is laborious and tedious. Considering that deep learning achieves good performance in many fields of imaging processing, it can be used to make up the drawbacks of the traditional morphological method. Thus, we propose a full-automatic system for the EM image segmentation task, which can obtain the EM shape characteristics to assist researchers to detect and identify EMs effectively. The proposed system has two parts: The first part is a novel deep Convolutional Neural Network (CNN), namely, "mU-Net-B3", with a Conditional Random Field (CRF) based pixellevel segmentation approach; the second part is a VGG-16 network [6] based patch-level segmentation method. In the pixel-level part, high-quality segmentation results are obtained on most EM images but lose effectiveness on some details with under-segmentation problems in some images. Therefore, we propose the patch-level part to assist the system to obtain more details of EMs. Hence, our Multiscale CNN-CRF (MSCC) segmentation system can solve the EM image segmentation effectively.
In the pixel-level part, mU-Net-B3 with denseCRF is used as the core step for the segmentation task, where mU-Net-B3 is an improved U-Net. Compared with U-Net, it effectively improves the performance of segmentation result     BioMed Research International and reduces the memory requirement. Because denseCRF [7] can obtain global information between pixels in an image, it is used as the postprocessing after mU-Net-B3, which further improves the performance of the segmentation results. In the patch-level part, the segmentation task is actually a binary classification task. Because of the outstanding classification ability of VGG-16 in ImageNet [6] and the significant performance of transfer learning with limited training data set, we use the limited EM training data to fine-tune the VGG-16 model pretrained by ImageNet, which provides hundreds of   3 BioMed Research International object categories and millions of images [6], in our patchlevel part. This approach effectively generates good classification results, from which we reconstruct the patch-level segmentation results. The EM segmentation framework is shown in Figure 1.
In Figure 1, (a) denotes the "Training Images": The training set contains 21 categories of EM images and their corresponding ground truth (GT) images. We unify the image size to 256 × 256 pixels. Considering the colour information is inefficient in EM segmentation [8], these images Actinophrys Noctiluca Rotifera Colpoda Figure 5: The variety of the object sizes in EM images.   BioMed Research International are converted into grayscale; (b) shows the "Patch-level Training": Images and their corresponding GT images are meshed into patches (8 × 8 pixels). Then, the data augmentation operation is used to balance patch data. After that, the balanced data are used to fine-tune the pretrained VGG-16 to obtain the classification model; (c) is the "Pixel-level Training": Data augmentation is applied to make up the lack of data. Then, the data are fed to the mU-Net-B3 to obtain the segmentation model; (d) is "Testing Images": The test set only has original images. We, respectively, convert them into grayscale images and patches for pixel-level and patch-level tests; (e) denotes the "Pixel-level Post-processing": The den-seCRF is used to further improve the pixel-level segmentation results; (f) shows "Patch-level Post-processing": The predicted labels of patches are used to reconstruct the patch-level segmentation results. For further optimization, the denseCRF results are used to create the buffers to help the patch-level results to denoise. (g) is the "Final Results": The denseCRF results and buffer results are combined and plotted by different colours on the original images.
The main contributions of this paper are as follows: (i) We propose a novel automatic approach that segments EM images from pixel-level and patchlevel to assist EM analysis work (ii) We propose three different strategies to optimize the original U-Net from the perspective of the receptive field, which well improve the segmentation performance (iii) The proposed mU-Net-B3 not only improves the segmentation performance but also reduces the memory requirement to less than a third of that of U-Net  [10] shows a comparison between threshold-based segmentation methods for biofilms. The last result shows that iterative selection method is superior; in [11], different algorithms that are based on Otsu thresholding are applied for the segmentation of floc and filaments to enhance monitoring of activated sludge in waste water treatment plants. Edge-based methods: A segmentation and classification work is introduced to identify individual microorganism from a group of overlapping (touching) bacteria in [12]. Canny is used as the basic step of the segmentation part in [12]; in [13], to be able to segment large size images of zoo-planktons, a segmentation (based on Active Contour) and preclassification algorithm is used after the acquisition of images. Region-based methods: In [14], the segmentation is performed on gray-level images using marker controlled watershed method; in [15], after converting the colour mode and using morphological operations to denoise, seeded region-growing watershed algorithm is applied for segmentation.    Figure 9: The architecture of mU-Net.

Input
Output Input  BioMed Research International methods. Unsupervised methods: [16] evaluates clustering and threshold segmentation techniques on tissue images containing TB Bacilli. The final result shows that k-means clustering (k = 3) is outstanding; In [17], a comparison between condition random fields and region-based segmentation methods is presented. The final result shows that these two kinds of methods for microorganism segmentation have an average recognition rate above 80%. Supervised Methods: In [18], a segmentation system is designed to monitor the algae in water bodies. Its main thought is image enhancement (sharpening) applied first by using the Retinex filtering technique, then segmentation is done by using support vector machine; in [19], a network for segmentation of Rift Valley virus is proposed. Because of the insufficient data set, data augmentation is used to assist U-Net, which is used for segmentation.
2.2.1. U-Net. U-Net is a convolutional neural network, which is initially used to perform the task of medical image segmentation. The architecture of U-Net is symmetrical. It consists of a contracting path and an expansive path [20]. There are two important contributions of U-Net. The first is the strong use of data augmentation to solve the problem of insufficient training data. The second is its end-to-end structure, which can help the network to retrieve the information from the shallow layers. With the outstanding performance, U-Net is widely used in the task of semantic segmentation. The network structure of U-Net is shown in Figure 2.
2.2.2. Inception. The original Inception, which uses filters of different sizes (1 × 1, 3 × 3, 5 × 5), is proposed in GoogleNet [22]. Because of the use of these filters, Inception has the capacity to adapt objects that have various sizes in images. However, there are also some disadvantages with the different filters used, for instance, the increasing of parameters, overfitting, and vanishing gradient. To reduce the negative effects, Inception-V2 gives a novel method, which is combining two 3 × 3 convolution filters to replace one 5 × 5 convolution filter [21]. For further optimization, Inception-V3 proposes a better approach, which uses a sequence of 1 × N convolution filter and N × 1 convolution filter to replace N × N convolution filter [21]. Figure 3 also shows the 3 × 3 convolution filter replaced by 1 × 3 and 3 × 1 convolution filters. This strategy reduces more parameter count. Furthermore, with more convolution filters with ReLU used, the expressiveness is improved.

DenseCRF.
Although CNNs can perform well on pixellevel segmentation, there are still some details that are not perfect enough. The main reason is it is difficult to consider the spatial relationships between different pixels in the process of pixel-level segmentation by CNNs. However, [23] shows that using denseCRF as postprocessing after CNNs can capture the spatial relationships. It can improve the segmentation results. In [7], the energy function of denseCRF model is the sum of unary potential and pairwise potential, which is shown in Eq. (1).
In Eq. (1), x is the label assignment of pixel. Uðx i Þ represents the unary potential, which measures the inverse likelihood of the pixel i taking the label x i , and Pðx i , x j Þ means the pairwise potential, which measures the cost of assigning labels x i , x j to pixels i, j simultaneously [24]. We use Eq. (2) as unary potential, where Lðx i Þ is the label assignment probability at pixel i.

BioMed Research International
The pairwise potential is defined in Eq. (3), where ∅ ðx i , x j Þ is a penalty term on the labelling [25]. As explained in [7], ∅ðx i , x j Þ is given by the Potts model. If pixel i and pixel j have the same label, the penalty term is equal to zero, and if not, it is equal to one.
As Eq. (3) shows, each k ðmÞ is the Gaussian kernel, which depends on the feature vectors f i , f j of pixels i, j,and is weighted by ω ðmÞ . In [7], it uses contrast-sensitive two-kernel potentials, defined in terms of the colour vectors I i and I j and positions p i and p j . It is shown as Eq. (4).
The first appearance kernel depends on both pixel positions (denoted as p) and pixel colour intensities (denoted as I). The second smoothness kernel only depends on pixel positions. And the parameters σ α , σ β , and σ ω control the scale of  Gaussian kernels. The first kernel forces pixels with similar colour and position to have similar labels, while the second kernel only considers spatial proximity when enforcing smoothness [23].

VGG-16.
Simonyan et al. propose VGG-16, which not only achieves the state-of-the-art accuracy on ILSVRC 2014 classification and localisation tasks but is also applicable to other image recognition data sets, where they achieve excellent performance even when used as a part of relatively simple pipelines [6]. The architecture of VGG-16 is shown in Figure 4.

Multiscale CNN-CRF Model
3.1. Pixel-Level Training. In pixel-level training, our novel multilevel CNN-CRF framework is introduced. In our data set, there are many objects of various sizes. As Figure 5 shows, we can easily find that the EM shapes in different categories are completely different. Considering the current U-Net is difficult to adapt to this situation, we propose novel methods to optimize the adaptability of U-Net.
As the U-Net structure is shown in Figure 2, we can find that the receptive field of U-Net is limited. To optimize the adaptability of U-Net, the direct way is using convolution filters of different sizes, just as Inception does. We propose BLOCK-I, which incorporates 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolution filters in parallel, as shown in Figure 6. Although this approach can help the network to improve the adaptability, it also makes more parameters.
Inspired by Inception-V2 [21], a 5 × 5 convolution filter actually resembles a sequence of two 3 × 3 convolution filters. Likewise, a 7 × 7 convolution filter can be replaced by a  11 BioMed Research International sequence of three 3 × 3 convolution filters. In [26], the concatenate operation is used to concatenate the outputs after the first convolution operation and the second convolution operation with the output of the third convolution operation in a sequence of three 3 × 3 convolution operations to obtain the result, which resembles the interaction result of 3 × 3, 5 × 5, and 7 × 7 convolution operations. Therefore, we apply this concept to optimize BLOCK-I, and we get a novel architecture called BLOCK-II. BLOCK-II is shown in Figure 7. Compared with BLOCK-I, this architecture can reduce parameters.
Although the parameters of BLOCK-II are quite less than BLOCK-I, there is still some room for improvement in this architecture. As we mentioned about Inception-V3, a 3 × 3 convolution filter can also be replaced by a sequence of 1 × 3 and 3 × 1 convolution filters. We apply this concept in BLOCK-III, which is shown in Figure 8. The experiments show that this approach can effectively reduce the memory requirement and achieve well-performed results.
Finally, we provide the whole architecture of our network mU-Net in Figure 9. Because of the least memory requirement of BLOCK-III, we deploy BLOCK-III in mU-Net architecture in our final method. Besides, we add a batch normalization layer [27] after each convolution layer and convolution transpose layer. For short, mU-Net with BLOCK-X is abbreviated as "mU-Net-BX". The details of  7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Loss Accuracy  7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Loss Accuracy  BioMed Research International mU-Net-BXs are provided in Table 1. The details of hyperparameters used in the pixel-level training process are provided in the following subsection: Pixel-level Implementation Details.
3.2. Patch-Level Training. In our patch-level training, we use our data set to fine-tune the VGG-16 [6], which is pretrained on a large-scale image data set ImageNet [28,29].

Fine-Tune Pretrained VGG-16.
It is proved that the use of VGG-16 pretrained on ImageNet can be useful for classification tasks through the concept of transfer learning and fine-tuning in [30]. In our framework, the patch-level segmentation is actually a classification task.
To fine-tune the pretrained model, we mesh the training EM images into patches of 8 × 8 pixels. The examples are shown in Figure 10. There are two reasons for using patches of 8 × 8 pixels. First, all the EM image sizes are converted into 256 × 256 pixels where 256 can only be divisible by 2, 4, 8, 16, 32, 64, 128, or 256. Second, the patches, which are too large or too small, make no sense for the patch-level segmentation, because small patches cannot obtain details of EMs and large patches will result in poor segmentation results. We provide some examples of patches of different sizes in the original EM images in Figure 11. As we can see, patches of 2 × 2 and 4 × 4 pixels are too small to cover the details of EMs, and patches of 16 × 16 pixels are too large for the images.
After that, we divide these patches into two categories: (With Object) and (Without Object). The criterion for dividing is the area of the object in each patch. If the area is more than half of the patch, we will give the label of (With Object) to the patch. If not, the label will be (Without Object).
Finally, we apply data augmentation to make the number of patches in two categories balanced, and use balanced data to train a classification model through fine-tuning the pretrained VGG-16. As we can see from Figure 4, the VGG-16 is pretrained by ImageNet. The pretrained model can be downloaded from Keras [31] directly. Before fine-tuning the pretrained VGG-16, we freeze the parameters of the pretrained model. After that, we use the balanced patch-level data to fine-tune the dense layers of VGG-16. The details of hyperparameters used in the patch-level training process are provided in the following subsection: Patch-level Implementation Details.
3.3. Pixel-Level Postprocessing. In our pixel-level segmentation, after getting the segmentation results from mU-Net-B3, we convert the results into binary images, where the foreground is marked as 1 (white) and the background is marked as 0 (black), and use these binary images as the initial matrices of denseCRF. It can effectively obtain the global information of images to optimize the segmentation results.

Patch-Level
Postprocessing. In our patch-level segmentation, we use the predicted labels generated by VGG-16 to reconstruct the segmentation results. To remove the useless portions of the patch-level segmentation results, we built up the buffers by using the pixel-level postprocessing (den-seCRF) results. The process is shown in Figure 12. The way to make buffers is applying dilate operation to the denseCRF results. After that, we use these images as weight matrices to apply to the patch-level results. Only the patch-level segmentation results in the buffers are retained, and the segmentation results outside the buffers are erased. This approach can effectively help to denoise.

Segmentation Results Fusion and Presentation.
After obtaining the segmentation results of pixel-level and patchlevel, respectively, the final segmentation results are generated by combining these two kinds of segmentation results. For the convenience of observation, the segmentation results of pixel-level and patch-level are plotted on the original images in the form of masks of different colours. The masks of pixel-level are red, the masks of patch-level are fluorescent green, and the overlapped parts of pixel-level and patch-level segmentation results are yellow. Examples are shown in Figure 13.     4.1.4. Pixel-Level Implementation Details. In our pixel-level segmentation, the task of the segmentation is to predict the individual pixels whether they represent a point of foreground or background. Actually, this task can be seen as a pixel-level binary classification problem. Hence, as the loss function of the network, we simply take the binary cross-entropy function 17 BioMed Research International and minimize it [26]. Besides, we use Adam optimizer with 1:5 × 10 −4 learning rate in our training process. The models are trained for 50 epochs using Adam optimizer. As the average training loss and accuracy curve of the training process is shown in Figure 15, we can find that the loss and accuracy curves of training and validation tend to level off after 30-35 iterations. Therefore, considering the computing performance of the workstation, we finally set 50 epochs for training. 4.1.5. Patch-Level Implementation Details. In our patch-level training process, we employ the pretrained VGG-16 as the core and fine-tune the dense layers of VGG-16. As Figure 4 shows, the last layer is softmax. The categorical crossentropy function is the loss function of choice for softmax output units. Besides, Adam optimizer with 1:0 × 10 −4 learning rate is used in VGG- 16 [3], Recall and Accuracy are used to measure the segmentation results. Besides that, we employ Dice, Jaccard, and VOE (volumetric overlap error) to evaluate the segmentation results in this paper [34]. The definitions of these evaluation metrics are provided in Table 2. V pred represents the foreground that is predicted by the model. V gt represents the foreground in a ground truth image. From Table 2, we can find that the higher the values of the first four metrics (Dice, Jaccard, Recall, and Accuracy) are, the better the segmentation results are. On the contrary, the lower the value of the final metric (VOE) is, the better the segmentation result is.

Evaluation of Pixel-Level Segmentation.
Because the pixel-level segmentation methods are discussed above, we mainly introduce comparisons between U-Net [20], the models we proposed, the existing segmentation methods mentioned in Related Works, and the segmentation result of our previous work [3] in this section.

Evaluation of Different BLOCKs.
In this part, we make comparisons between different mU-Net-BXs and U-Net on memory requirement, time requirement, and segmentation performance.
Memory Requirement: The memory requirements of U-Net and mU-Net-BXs are provided in Table 3. As we can see, the memory requirements of U-Net, mU-Net-B1, mU-Net-B2, and mU-Net-B3 are 355 MB, 407 MB, 136 MB, and 103 MB, respectively. Obviously, mU-Net-B3 has the lowest memory requirement.
Time Requirement: For 840 training images and 210 testing images, the time requirements of U-Net and these improved models, which include training and average testing   Figure 1, the evaluation indexes of all improved models are provided with denseCRF as the postprocessing. The overall segmentation performance of U-Net and these improved models are shown in Figure 16.  21 BioMed Research International models make better performance than U-Net. Compared with U-Net, the average Dice values of all the improved models are increased by more than 1.8%, and in particular, the improvements of mU-Net-B1 and mU-Net-B2 are more than 2%. The average Jaccard values of mU-Net-B1, mU-Net-B2, and mU-Net-B3 make 2.89%, 2.75%, and 2.32% improvements, respectively. Likewise, the improvements of the average Recall values made by these improved models are 4.98%, 4.91%, and 4.85%, respectively, and for the average Accuracy values, the improvements of these improved models are 0.65%, 0.34%, and 0.15%, respectively. The average VOE values of the improved models are reduced by 2.89%, 2.75%, and 2.32%, respectively. Summary: From the above, we can find that all the improved models make better segmentation performance than U-Net. Compared with mU-Net-B1 and mU-Net-B2, mU-Net-B3 has the lowest memory requirement, relatively low time requirement, and the similar performance, so it has a big potential in the EM image segmentation work.
After evaluating the overall performance of these methods, we also provide the detailed indexes and segmentation result examples of each category of EM under these methods in Table 5 and Figure 17, respectively.

Comparison with Other
Methods. In this part, we conduct some comparative experiments on the segmentation of EM. During the experiments, we mainly adopt some representative segmentation methods mentioned in Related Works, including Otsu, Canny, Watershed, MRF, and k -means. During the experiments, because the results are often insufficient, we need some postprocessing for the results. To show better segmentation results of these methods, we uniformly use the same postprocessing operations. To evaluate the overall performance of these methods, we provide the average evaluation indexes of these methods in Figure 18.
From Figure 18, we can find none of the methods performs as well as the proposed methods. But we can find that the recall values in Figure 18 are higher than the recall values in Figure 16. This is because some of the segmentation results generated by these methods have a lot of background parts divided into the foreground. From Table 2, we can realize that as long as the foreground in the segmentation result contains the entire real foreground in GT images, the value of recall is 1 regardless of whether the oversegmentation problem is existing or not. Therefore, we should not judge the segmentation results by Recall alone.
To better observe the performance of these methods, we provide the detailed indexes of the segmentation results of each category of EM under these methods in Table 6. Besides, we also provide examples of the segmentation results under these methods in Figure 19.

Comparison with our Previous
Work. In our previous work [3], the EMDS-4 data set we used contains only 20 categories. The 17th category (Gymnodinium), which is used in this paper, is excluded from our previous work. Besides, we only use Average Recall and Overall Accuracy to evaluate the segmentation performance in our previous work. Therefore, we provide the evaluation indexes of the segmentation results obtained by mU-Net-B3 with denseCRF without the 17th category. Furthermore, in our previous work, there are   [23] (denseCRForg), and fully convolutional network (FCN). We provide the Average Recall and Overall Accuracy values of mU-Net-B3 with denseCRF as postprocessing and our previous models in Figure 20. It can be found from Figure 20 that compared with the previous models, the Average Recall is improved by more than 7% and the increase of Overall Accuracy is by at least 1%. From that, we can realize mU-Net-B3 with denseCRF we proposed in this paper performs better than the models in our previous work.

Evaluation of Patch-Level Segmentation.
Although mU-Net-B3 with Dense CRF performs well on the segmentation task for most categories of EM, there are still some shortages. For example, as the results of Colpoda shown in Figure 21, mU-Net-B3 is not able to segment the whole object, leading to an undersegmentation result. Therefore, we use patchlevel segmentation to make up this shortage.
4.3.1. The Criterion for Assigning the Labels. In this part, we mainly discuss the criterion for assigning the labels to the patch in training and validation data sets and the determination of buffer size. As we mentioned above, we divide the patches into two categories: (With Object) and (Without Object). The criterion for assigning these two labels to the patch is whether the area of the object is more than half of the total area of the patch. There are two reasons for using the half area as the criterion. The first reason is that when we choose 0.25 area and 0.75 area as the criteria, the results do not make much difference. This is because when we, respectively, use these three criteria, the number of patches in the two categories varies so little. We provide detailed numbers of patches in the two categories under different criteria in Table 7. It means that most patches that contain objects are divided into (With Object). The second reason is that it can show the lowest loss and the highest accuracy on the validation data set when compared with 0. 25 Figure 22, we can find that the patch-level segmentation results contain a lot of noises around the objects we need to segment. We only want to retain the useful parts of the patch-level segmentation results and remove the useless parts. The direct way is establishing buffers near the pixel-level segmentation results. The challenge is how to set the size of the buffer. The solution we propose is combining the patch-level segmentation results under different buffer size settings with pixel-level segmentation results and comparing the combined results with GT images to determine the size of the final buffer based on the performance of evaluation indexes. Furthermore, we make a comparison between the buffers of different sizes. It starts 23 BioMed Research International with a buffer size of 2 pixels and gradually increases the buffer size by 2 pixels until the buffer size is 40 pixels. After that, the patch-level segmentation results after different buffer processing are combined with the pixel-level segmentation results. Finally, the combined results are compared with GT images to obtain relevant evaluation indexes, which are shown in Figure 23. We determine the buffer area size corresponding to the intersection point of Accuracy and Recall in

Evaluation of Combined Segmentation Results.
To observe the advantages of combining patch-level segmentation with pixel-level segmentation better, we provide some examples and their corresponding evaluation indexes in Figures 21 and 25, respectively. We can find that patchlevel segmentation effectively helps to improve the shortage of pixel-level segmentation.

Segmentation Result Fusion and Presentation.
Finally, we provide the combined results of patch-level segmentation results and pixel-level segmentation results in Figure 13. The yellow parts in the images are the overlapping areas of the patch-level segmentation results (fluorescent green parts) and pixel-level segmentation results (red parts). The purple outline plotted on the images is the GT images.

Conclusion and Future Work
In this paper, we propose a multilevel segmentation method for the EM segmentation task, which includes pixel-level segmentation and patch-level segmentation. In our pixel-level segmentation, we propose mU-Net-B3 with denseCRF for EM segmentation. It mainly uses the idea of Inception and the use of concatenate operations to reduce the memory requirement. Besides, it also uses denseCRF to obtain global information to further optimize the segmenta-tion results. The proposed method not only performs better than U-Net but also reduces the memory requirement from 355 MB to 103 MB. In the evaluation of segmentation results generated by this proposed method, the values of evaluation indexes Dice, Jaccard, Recall, Accuracy, and VOE (volume overlap error) are 87.13%, 79.74%, 87.12%, 96.91%, and 20.26%, respectively. Compared with U-Net, the first four indexes are improved by 1.89%, 2.32%, 4.84%, and 0.14%, respectively, and the last index is decreased by 2.32%. Besides, compared with our previous methods in [3], the performance of segmentation results is significantly improved, and the details of indexes are shown in Figure 20.
Since the method used in pixel-level segmentation cannot segment some details in the image, we use patch-level segmentation to render assistance to improve it. In the patch-level segmentation, we use transfer learning, which is using our data to fine-tune the pretrained VGG-16, to perform the patch-level segmentation task. We can find from Figure 13 that the patch-level segmentation can effectively assist the pixel-level segmentation to cover more details.
In our future work, we plan to increase the amount of data in the data set to improve the performance. Meanwhile, we have not optimized the time requirement in pixel-level segmentation yet, but we will adjust the relevant parameters to reduce the time requirement.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.