Automatic Liver Segmentation in CT Images with Enhanced GAN and Mask Region-Based CNN Architectures

Liver image segmentation has been increasingly employed for key medical purposes, including liver functional assessment, disease diagnosis, and treatment. In this work, we introduce a liver image segmentation method based on generative adversarial networks (GANs) and mask region-based convolutional neural networks (Mask R-CNN). Firstly, since most resulting images have noisy features, we further explored the combination of Mask R-CNN and GANs in order to enhance the pixel-wise classification. Secondly, k-means clustering was used to lock the image aspect ratio, in order to get more essential anchors which can help boost the segmentation performance. Finally, we proposed a GAN Mask R-CNN algorithm which achieved superior performance in comparison with the conventional Mask R-CNN, Mask-CNN, and k-means algorithms in terms of the Dice similarity coefficient (DSC) and the MICCAI metrics. The proposed algorithm also achieved superior performance in comparison with ten state-of-the-art algorithms in terms of six Boolean indicators. We hope that our work can be effectively used to optimize the segmentation and classification of liver anomalies.


Introduction
Segmentation of computed-tomography (CT) liver images is currently a standard technique for computer-assisted diagnosis and therapy, which enjoys the advantages of high accessibility, acceptable acquisition time, and good spatial resolution [1,2]. The correct estimation of the liver volume (technically referred to as liver volumetry) is an essential prelude and preparation for hepatic surgery that must precede any major hepatectomy or liver transplantation [3,4]. Currently, there is a growing interest in performing liver volumetry in order to cope with the recent increase in extended hepatectomies, split-liver transplantations, and liver transplantations from living donors [5]. Liver volumetry has been partially or fully automated so as to improve repeatability and accuracy as well as to reduce processing times [6]. However, liver segmentation suffers from a variety of problems and difficulties [7]. Firstly, the segmentation performance is typically degraded by the influence of complex surrounding blood vessels and organs [8]. Also, the liver shape shows high variability across different sections in the same set of CT images [9,10]. In addition, the density of the liver tissues is highly similar to the densities of many other types of soft tissues in the abdominal cavity [11]. Moreover, medical CT imaging often produces images of low contrast and uneven grey scales, making it difficult to accurately segment liver images [12]. In short, liver segmentation in CT images has become a major challenge as it can hardly achieve the desired or expected outcomes. Numerous deep-learning liver segmentation methods have been proposed in order to alleviate or partially solve the above problems [13,14]. Such methods can help the radiology staff to further improve liver disease diagnosis, achieve timely detection and treatment, and reduce the death risk due to liver cancer [15].
Although automated segmentation methods have been frequently proposed, those methods have not necessarily been implemented in routine clinical use [16]. The cause of slow adaptation to automation by the medical community is believed to arise from limitations in clinical validation, rather than to stem from lack of technical ingenuity. Most CT images contain fuzzy or noisy features, which can lead to a substantial reduction in segmentation accuracy [17,18]. An effective automated segmentation method should be equipped with a validation framework that encompasses the following components: (a) employment of a valid reference standard; (b) validation datasets that reflect the actual clinical practice; (c) clear metrics that assess and measure the segmentation precision, accuracy, efficiency, and error; and (d) a comparison of the aforementioned metrics via agreed-upon effective statistical tools.
We further explored the combination of Mask R-CNN and GANs to enhance the pixel-wise classification performance. Also, k-means clustering was used to lock the image aspect ratio in order to get more essential anchors which can help boost the liver segmentation performance in computed tomography (CT) images. Indeed, the segmentation process is quite challenging and crucial due to the fuzziness of the liver pixel boundaries, the highly similar intensity patterns of the liver and its neighbouring organs, the high noise levels, and the large variations in tumor shape and appearance. Therefore, liver segmentation in CT images should be effectively performed before other tasks of target measurement, detection, and recognition. A GAN [19,20] is also integrated in the mask region-based convolutional neural network (Mask R-CNN) architecture in order to create a new GAN-Mask R-CNN framework that boosts the liver segmentation performance in CT images [21]. In our framework, we make four key contributions: (1) We explored pixel-wise classification enhancements through the combination of Mask R-CNN and GANs, augmentation of the Mask R-CNN training data, and exploitation of the generated synthetic data (2) We used k-means clustering to lock the image aspect ratio in order to get more key anchors which can help get better segmentation results (3) The performance of the proposed framework was compared against that of the conventional Mask R-CNN algorithm, in terms of the Dice similarity coefficient (DSC), volume overlap error (VOE), relative volume difference (RVD), the average symmetric surface distance (ASSD), root-mean-square symmetric surface distance (RMSD), and maximum symmetric surface distance (MSSD) (4) Additionally, our proposed GAN Mask R-CNN achieved superior performance in comparison with ten state-of-the-art algorithms. The comparison is based on the other six indicators including the overall accuracy, sensitivity, specificity, precision, false discovery rate (FDR), and false omission rate (FOR)

Related Work
Accurate diagnosis is highly required for liver therapy planning, liver size evaluation, and optimal clinical decisionmaking. Medical CT imaging provides accurate anatomical information for the human abdominal organs, especially liver segmentation and disease diagnosis [22]. Also, liver anatomy visualization and segmentation from CT scans provide significant guidance for liver surgery planning. For CT-based clinical diagnosis of liver diseases, reliable and accurate liver segmentation and identification of surrounding anatomical structures are crucial for subsequent treatment planning and computer-assisted surgery. However, in the current clinical practice, radiologists still manually delineate the liver on each CT slice in order to achieve the most accurate segmentation results, but this manual process is quite time consuming, tedious, and laborious and also leads to significant intraobserver differences. In addition, liver segmentation is a challenging task due to the boundary blurring, low contrast, and uneven strength in liver CT images. Therefore, over the past decade, numerous studies have demonstrated effective, robust, and accurate algorithms (with varying degrees of success) for liver image segmentation in clinical practice. Depending on whether user interaction is required for liver segmentation, these methods can be broadly divided into two categories: automatic and semiautomatic methods [23]. Moreover, the use of automatic segmentation in clinical applications requires evaluating and comparing the accuracies of different segmentation models. Recently, deep learning approaches have been employed to automatically obtain the most suitable segmentation model from given training data. Also, remarkable performance outcomes were achieved by these approaches through creating multiple levels of abstraction and descriptive embedding in a hierarchy of increasingly complex features [24]. For example, a semisupervised CNN was designed by Liu et al. to significantly decrease the requirement for labelled training data [25]. Multiple sparse regression models (such as deep ensemble sparse regression networks) were employed by Suk and Shen for clinical decision-making tasks [26]. A parameter-efficient CNN was designed by Spasov et al. for performing 3D separable convolution, combining specific layers and dual learning, and hence predicting the transition from mild cognitive impairment (MCI) to Alzheimer's disease within 3 years [27]. A semisupervised graph convolutional network was trained by Parisot et al. on node subsets labelled with diagnostic outcomes for representing and processing sparse clinical data [28]. For five organ structures, Lustberg et al. compared automated atlas-based contour generation results obtained using a commercial deep learning module [29]. In addition, Ahn et al. used a fusion-based U-Net model for medical image segmentation. This model was employed to evaluate the clinical feasibility of an open-source deeplearning framework trained on the data of 70 patients with liver cancer and also compare the performance of this framework with that of another commercially available atlas-based automatic segmentation framework [30].
In our work, we further explore the combination of Mask R-CNN and GANs to enhance the pixel-wise classification performance. Similar approaches have been proposed in earlier studies. For example, Frid-Adar et al. [31] generated synthetic medical images using generative adversarial networks (GANs) and used the generated images to improve the CNN-based classification performance for medical images.

BioMed Research International
In order to boost the lung nodule detection sensitivity in CT images, Han et al. [32] naturally placed realistic diverse lung nodules in CT images using a 3D multiconditional GAN (MCGAN). The 3D CNN-based detection outcomes were of higher sensitivity under any nodule size or attenuation at a fixed FPR. Indeed, the medical data scarcity was overcome by the MCGAN-generated lung nodules. Furthermore, unsupervised learning is also widely used in image segmentation. For example, an intelligent framework was proposed by Rundo et al. [33], where robust tools for validating radiomics biomarkers were provided for seamless integration into clinical research environments. In particular, this framework optimized the segmentation for each individual image while also taking into account prior domain knowledge for the typical densities of candidate subregions. The automation of this approach allows for easy deployment in clinical research environments, without the need for any training data. Anter and Hassenian [34] proposed an improved approach for liver segmentation in CT images based on a fast fuzzy c-means clustering algorithm (FFCM), neutrosophic sets (NS), and a watershed algorithm. In order to increase the CT image contrast and remove high frequencies, histogram equalization and median filtering were used. An unsupervised medical anomaly detection GAN (MAD-GAN) method was proposed by Han et al. [35]. In this novel two-step method, the GANs are used to reconstruct multiple adjacent magnetic resonance imaging (MRI) slices of the brain, and then, brain diseases are diagnosed and staged based on multisequence structural MRI data. Also, Nakao [36] proposed an unsupervised anomaly detection method based on variational autoencoders (VAE-GAN) and demonstrated its ability to detect various lesions using a large chest radiograph dataset. Unlike the widely used supervised methods for computer-aided diagnosis or detection in chest radiographs, the VAE-GAN-based unsupervised method can detect lesions of any type and does not require any abnormal image samples or lesion labels for training.

The Original
Mask R-CNN. The Mask R-CNN architecture is an enhanced variant of the R-CNN, Fast R-CNN [37], and Faster R-CNN [38] architectures. In particular, the Mask R-CNN was immediately preceded by the Faster R-CNN. The Mask R-CNN represents a general conceptually simple and flexible framework for object detection and segmentation, where high-quality segmentation masks are simultaneously generated for image instances. While the Faster R-CNN has two branches for classification and bounding box regression, the Mask R-CNN has an additional third branch for segmentation mask prediction on each region of interest (RoI). This mask branch is just a small fully convolutional network (FCN) which acts upon each RoI to perform pixel-wise segmentation mask prediction. The Mask R-CNN can be easily trained and incurs a small computational overhead in comparison to the Faster R-CNN. Figure 1 outlines the conventional Mask R-CNN framework for image segmentation. While the Mask R-CNN and Faster R-CNN have similar workflows, they still have some key differences. On the one hand, the Faster R-CNN suffers from spatial information loss and hence exhibits less accurate feature extraction and RoI detection. On the other hand, the Mask R-CNN uses a region proposal network (RPN) for feature extraction, as well as bounding box tight localization and classification. Also, the Mask R-CNN employs the RolPool method in feature extraction, RoI quantification, and handling of multiscale RoI features through maximum convergence. Moreover, the Mask R-CNN replaces the Rol-Pool layer of the Faster R-CNN with an RoI alignment (RoIAlign) layer for the mask-labelled object area.
The Mask R-CNN can be employed for multitask learning, with the following loss function formulation: where L cls is the target classification loss, L bbox is the regression loss for the target bounding box, and L mask is the target segmentation loss, which is defined based on the target segmentation requirements in comparison with the traditional detection network.

Improved Mask R-CNN.
With additional training iterations, the Mask R-CNN learns the global features of liver images, while the prediction box parameters are iteratively adjusted until they are really close to the true box parameters.
In this paper, we strive to accelerate the convergence and improve the localization precision for liver detection and segmentation. We achieve these goals by analyzing the distribution of the aspect ratios of the liver images via k-means clustering. In the network training stage, as more training iterations are performed, the network learns the global liver characteristics in the CT images; the prediction box parameters are progressively adjusted; and finally, the ground-truth boxes are approached. In order to accelerate the convergence speed and improve the liver localization accuracy, the liver height and width characteristics are analyzed in the CT images, and hence, k-means clustering is applied to the height and width data using the Euclidean distance. This clustering algorithm measures the distance between patterns using the Euclidean distance and identifies the cluster centers through a given bounding box of anchors, where the output box is chosen as the closest one to an anchor. This process is repeated until the anchors reach a prespecified number. Figure 2 shows the framework of RoIAlign with k-means clustering.  Figure 3 shows the proposed GAN Mask R-CNN architecture for liver image segmentation.
When the noise z is sampled from the latent space and fed into the generative model G, a sample x = G ðzÞ is generated. For neural networks, the probability distribution P G ðxÞ of the generated samples might be significantly more complicated. The generative model G is trained to make the probability distributions P G ðxÞ and P data ðxÞ as close to each other as possible. The generative model G seeks to generate fake data to confuse the discriminative model (D), while the discriminator D seeks to differentiate between the real and fake data samples. This adversarial learning enforces the distribution generated by G to gradually approach the real data distribution.
The adversarial learning scheme can be formulated as the following optimization problem: where Div denotes the divergence or dissimilarity between P Data ðxÞ and P G ðxÞ. The function of the discriminator D can be mathematically defined as where the two models D and G interact through a twoplayer minimax competitive game with the objective function V ðG, DÞ: Thus, the GAN parameter optimization problem can be formulated as The two models G and D are updated through alternating optimization. As the adversarial learning process evolves, the model G will generate data that gradually resembles the real data.
The training algorithm workflow is shown in Box 1. Use the momentum learning rule (or any other standard rule) for the gradient-based updates.

Experimental Setup and Data Collection.
In our experiments, we used a CentOS7 system with an Intel Core i7 CPU with 128 GB memory and an NVIDIA RTX8000 48 GB GPU. We used Python 3.6 to implement the corresponding algorithms. Experiments were performed on the Codalab dataset (https://competitions.codalab.org/) [48,49]. We employed 378 CT images for model training and the remaining sequences for testing. Codalab has liver CT images with three-phase ground-truth data. We used the enhanced CT format with a 512 × 512 image resolution.

Performance Evaluation with DSC and MICCAI Metrics.
We use the DSC to assess the performance of the compared methods of liver image segmentation [50]. The DSC metric represents the spatial coincidence degree between the output and ground-truth segmentation results. Also, we consider five other segmentation metrics provided by the Society of Medical Image Computing and Computer-Assisted Intervention (MICCAI) [51,52], namely, volume overlap error (VOE), average symmetric surface distance (ASSD), root-mean-square symmetric surface distance (RMSD), maximum symmetric surface distance (MSSD), and relative volume difference (RVD). The aforementioned segmentation metrics are mathematically defined as follows.
First of all, the DSC metric can be defined as where U 1 and U 2 denote corresponding output and groundtruth segmentation results, respectively. A higher DSC value indicates better segmentation performance. The DSC value ranges between zero (indicating a total dissimilarity between the output and ground-truth segmentation results) and one (indicating ideal total agreement between the output and ground-truth segmentation results). The other five MICCAI metrics are defined as follows. Once more, the symbols U 1 and U 2 denote corresponding output and ground-truth segmentation results, respectively. Also, SðU 1 Þ and SðU 2 Þ denote the outlines of the segmented liver region and the associated ground-truth region, respectively. In addition, d ðv, S ðU 1 ÞÞ denotes the shortest distance between any image pixel v and SðU 1 Þ. That is, dðv, SðU 1 ÞÞ = min SU 1 ∈SðU 1 Þ kv − S U 1 k, where k•k is the Euclidean distance operator. Apart from the DSC, lower values of four MICCAI metrics (VOE, ASSD, RMSD, and MSSD) indicate better matching between the output and ground-truth segmentation results. The RVD absolute value should be used in segmentation performance evaluation, since this metric might take negative values in the case of undersegmentation. The smaller the RVD absolute value is, the better the segmentation performance is. In general, the five MICCAI metrics are 0 when the segmentation is perfect.

Performance Evaluation with Binary Classification
Metrics. Several other metrics were exploited as well for evaluating the performance of liver segmentation algorithms, where image segmentation was cast as a binary classification problem with positive and negative classes corresponding to the liver and nonliver image pixels, respectively [35]. The term TP (true positives) stands for the number of pixels that are claimed by the segmentation method to be positive, while they are really positive according to the ground-truth labelling. The term FP (false positives) depicts the number of pixels that are suggested to be positive by the segmentation method but are actually negative according to the groundtruth labelling. The term TN (true negatives) expresses the number of pixels claimed to be negative by the segmentation method, while they are actually negative according to the 5 BioMed Research International ground-truth labels. The term FN (false negatives) denotes the number of pixels claimed negative by the segmentation method but are actually positive according to the ground truth. Based on these terms, key evaluation indicators are defined as follows: (1) Overall accuracy: the ratio of the correctly labelled pixels to the total pixel count Accuracy = TN + TP TN + TP + FN + FP ð8Þ (2) Sensitivity: the ratio of the correctly detected liver pixels to all true liver pixels is called the sensitivity or the recall   A learning hyperparameter is the number of discriminator steps, i. for the number of Mask R-CNN training iterations do Apply k-means clustering to lock the image aspect ratio, and reduce anchors. for i steps do • Pick a minibatch of m noise samples (z (1) , . . . , z (m) ) following the noise prior p g (z).
• Use the stochastic gradient ascent to update the discriminator: end for • Pick a minibatch of m noise samples (z (1) , . . . , z (m) ) following the noise prior p g (z).
• Use the stochastic gradient descent to update the generator: end for Box 1: GAN training with the minibatch stochastic gradient descent. 6 BioMed Research International (3) Specificity: the ratio of the correctly detected background pixels to all true nonliver background pixels is called the specificity (4) Precision: the ratio of the correctly detected liver pixels relative to the total number of pixels labelled as liver pixels (5) The false omission rate (FOR) of the liver pixels is defined as (6) Based on the precision P, the false detection rate (FDR) can be defined as

Analysis of the Experimental Results
Experiments were conducted to compare the performance of three schemes: the conventional Mask R-CNN, the Mask R-CNN with k-means clustering (for optimization of the fully connected layer parameters) [36], and the GAN Mask R-CNN which boosts the segmentation performance with adversarial learning capabilities. Eight slices of liver images (with normal and pathological cases) were collected and used to investigate the influence of the k-means clustering and GAN modules on the Mask R-CNN segmentation outputs. Figure 4 gives a comparison of the results which are shown in red contours. As shown in Figure 4, the conventional Mask R-CNN method obviously missed areas of marginal liver regions during liver slice processing and hence resulted in segmentation errors. The Mask R-CNN with k -means clustering managed to correct these segmentation errors by incorporating the aspect ratio information of the liver image sequence. However, there are still visible segmentation errors in marginal liver regions. Our proposed GAN Mask R-CNN improved, to a certain extent, the segmentation accuracy and robustness for each slice in the sequence. This is an obvious advantage of our proposed scheme over the two other Mask R-CNN variants.
As shown in Table 1, the experimental results of the GAN Mask R-CNN were evaluated in terms of the following indicators: the overall accuracy, sensitivity, specificity, precision, FOR, and FDR. Our proposed GAN-based algorithm has a relatively high evaluation accuracy, as well as very low omission rate. This performance can be ascribed to the training adequacy and the relative robustness of the output boundary, though missegmentation or oversegmentation errors still exist. Table 2 shows that the GAN Mask R-CNN performs significantly better (95.3%) than the other two algorithms. Two exceptions are the VOE result of the conventional Mask R-CNN (21.58%) and the MSSD result of the Mask R-CNN with k-means clustering (21.32%). In addition, it is clear that the GAN module significantly enhances the segmentation performance according to the other indicators.
Further experiments were made to assess the impact of the GANs and k-means clustering modules on improving the segmentation outcomes in comparison to the FCN-8s, U-Net, 2D-FCN2, 2D-FCN1, 2D-dense-FCN, 3D-FCN, H-DenseUNet, 3D U-Net, IU-Net, and GIU-Net algorithms. The performance was evaluated using the metrics of accuracy, recall, specificity, precision, FOR, and FDR. A comparison of the segmentation results of the ten algorithms is shown in Figure 5. The test data for this comparison includes slices with large, medium, and small liver regions.
The results in Figure 5 show that for the six liver slices considered, some undersegmentation or oversegmentation errors are made by the FCN-8s, U-Net, 2D-FCN2, 2D-FCN1, 2D-dense-FCN, 3D-FCN, H-DenseUNet, 3D U-Net, IU-Net, and GIU-Net algorithms. Also, some of the segmented liver slices do not exhibit complete boundaries, while others have extraneous parts that do not belong to the original liver slices. However, our GAN Mask R-CNN method can produce more solid boundaries and return segmented liver slices with no extra holes. Table 3 indicates that the GAN Mask R-CNN method significantly outperforms the other algorithms, except for the GIU-Net method which shows a better specificity, as well as the IU-Net method which shows a better precision compared to our method. Anyway, our algorithm clearly outperforms the IU-Net and GIU-Net algorithms according to all other indicators. In addition, our algorithm shows superior performance on all six indicators in comparison to the FCN-8s, U-Net, 2D-FCN2, 2D-FCN1, 2D-dense-FCN, 3D-FCN, H-DenseUNet, and 3D U-Net algorithms.

Conclusions
This paper introduces a new method for liver image segmentation in CT sequences where the GANs and the Mask R-CNN methods are combined. However, we found that most images exhibited noisy features in one way or another in the liver segmentation process [52,53]. Under the influence of complex surrounding blood vessels and organs, the liver shape varies between different sections in the same set of CT images, and there are many soft tissues in the abdominal cavity with a density similar to that of the liver soft tissue [39]. Also, medical CT imaging often presents problems of low contrast and uneven grey-scale intensities, making it difficult to segment liver images accurately in the area of interest [18,[54][55][56].
In this work, we proposed a new CT-based liver image segmentation framework. Firstly, we sought to get more important anchors (and hence improve the segmentation results) through a k-means clustering algorithm which was used to lock the image aspect ratio and reduce redundant and useless anchors. Secondly, we addressed the problem of the presence of noisy features in liver images, with no image enhancement typically applied, rendering a large number of images unusable and reducing the segmentation accuracy. Specifically, we employed a GAN architecture into our segmentation framework and demonstrated good performance in terms of six indicators: DSC, VOE, RVD, ASSD, RMSD, and MSSD. Thirdly, we compared our framework with that of FCN-8s, U-Net, 2D-FCN2, 2D-FCN1, 2Ddense-FCN, 3D-FCN, H-DenseUNet, 3D U-Net, IU-Net, and GIU-Net. This comparison was based on six indicators: overall accuracy, sensitivity, specificity, precision, FOR, and FDR. Our improved GAN Mask R-CNN architecture demonstrated the best overall performance. We hope that our work can help radiology practitioners to further improve the diagnosis, timely detection, and treatment of liver diseases and also reduce the risk of death due to liver cancer.
In many studies, open-source deep learning tools may be applied for automatic liver segmentation. The performance outcomes of such tools could be compared with the outcomes of conventional knowledge-based planning tools, which typically yield acceptable accuracy levels as well as good reproducibility for clinical use. Additionally, patientspecific dose prediction improves the efficiency and quality of radiation treatment planning. In particular, this prediction can significantly reduce the treatment planning time.
In the future, we envisage that deep-learning-based automatic segmentation will become clinically useful, especially for dynamic daily treatment plans based on multimodality imaging.

Data Availability
Data access is available on request through the Codalab competition website (https://competitions.codalab.org/). The contact person is Xiaoqin Wei, School of Medical

Conflicts of Interest
There is no competing interest relevant to the publication of this paper.

Authors' Contributions
Xiaoqin Wei contributed to the study design and conducted experimental procedures and analysis in addition to writing the first draft of the manuscript. Yong Du supervised this study and revised the manuscript. Yuanzhong Zhu and Hanfeng Yang contributed to the experimental procedures and data analysis. Xiaowen Chen and Ce Lai designed the study and oversaw the experimental work, data analysis, and writing. Xiaowen Chen and Ce Lai contributed equally to this work. All authors contributed to the article and approved the submitted version. Xiaoqin Wei, Xiaowen Chen, and Ce Lai contributed equally to this work.