URDNet: A Unified Regression Network for GGO Detection in Lung CT Images

We present a 3D deep neural network known as URDNet for detecting ground-glass opacity (GGO) nodules in 3D CT images. Prior work on GGO detection repurposes classifiers on a large number of windows to perform detection or fine-tuning by box regression based on a previous window classification step. Instead, we consider GGO detection as a multitarget regression problem to focus on the location of GGO. Furthermore, to capture multiscale information, we introduce a backbone network which is a contracting-expanding structure similar to 2D U-net, but we inject the source CT inputs into each layer in the contracting pathway to prevent source information loss at different scales. At last, we propose a two-stage training method for URDNet. In the first stage, the backbone of the network for feature extraction is trained, and in the second, the overall URDNet is fine-tuned based on the previous pretrained weights. By using this training method in conjunction with data augmentation and hard negative mining techniques, our URDNet can be effectively trained even on a small amount of annotated CT images. We evaluate the proposed method on the LIDC-IDRI dataset. It achieves the sensitivity of 90.8% with only 1 false positive per scan. Experimental results show that our detection method achieves the superior detection performance over the state-of-the-art methods. Due to its simplicity and effective, URDNet can be easier to apply to medical IoT systems for improving the efficiency of overall health systems.


Introduction
Lung cancer is currently a leading cause of cancer death worldwide and is responsible for more than 1.3 million deaths annually [1]. Detection and treatment of lung cancer at an early stage can improve the survival rate. GGO is a highly important CT imaging sign for detection of lung cancer at an early stage [2], which is defined as increased attenuation of the lung parenchyma without obscuration of the pulmonary vascular markings on the CT images [3]. Recently, the new coronavirus COVID-19 pandemic is prevalent, and its main symptoms are also related to GGO. However, due to their indistinct boundaries and no clear rules for brightness and shape, GGO nodules are easily overlooked, even by experienced radiologists. A promising solution to this problem is the use of computer-aided detection techniques.
The traditional architecture for computer-aided GGO detection typically consists of two stages: GGO candidate detection and false-positive reduction [4]. A small number of papers have been published on this topic. Bastawrous et al. [5] applied a Gabor filter to choose candidates and used an ANN to reduce false positives. Kim et al. [6] extracted tentative regions using binarization and classified the GGO nodules with a linear discriminant function. Jacobs et al. [7] first used intensity, shape, and context features to describe the appearance of candidates and subsequently applied a linear discriminant classifier and a gentle boost classifier to classify candidate regions. Although the conventional methods have yielded promising results, they still suffer from the low sensitivity and poor generalization, especially for notably small GGO nodules.
In recent years, nodule detection based on deep neural networks has achieved state-of-the-art detection performance. For example, Ginneken et al. [8] presented promising results for the extraction of nodule features using an off-the-shelf convolutional neural network (CNN) that was pretrained for a natural image classification task. Setio et al. [9] used multiple CNNs to extract discriminative features from the candidates, and these features were used to classify candidates as nodules or background. Superior performance was achieved in the false-positive reduction track. Roth et al. [10] proposed an effective 2.5D representation for lymph node detection to exploit the 3D information of nodules when training a deep network by taking slices of the CT images from a point of interest in 3 orthogonal views. The slices were subsequently combined into a 3-channel image as the network input. Han et al. [11] proposed hybrid resampling in multi-CNN models for 3D GGO nodules to cover a large range scale, which reduced the risk of missing small or large GGO nodules. In general, these methods rely on classification or a combination of classification and regression for detection. These types of methods usually do not pay enough attention to the location problem of the detection and often produce missed detection and inaccurate locations. In addition, various types of neural works have been applied in various applications, e.g., graph neural network for creative works [12], LVQ neural network for traffic prediction [13], generative adversarial networks (GANs) for style transfer [14], and 3D GANs for the simulation of creative stage scene [15]. The most important is that GGO detection requires a huge amount of computation in 3D CT and generally requires a more efficient detection method to meet actual needs in medical IoT.
In order to overcome the above limitations, we propose a unified regression deep neural network for GGO detection. We consider GGO detection as a multitarget regression problem, straight from 3D CT to bounding box coordinates. To acquire the discrimination information between the object and the background, the same pseudotarget (zero) is set for all the negative samples, so the pseudotarget also is denoted as zero target in our paper. Compared with the classification-based method, the learning goal of our whole detection is just a unified object location regression, which can guide the network to learn better object location information. Therefore, more attention can be paid to the localization problem in the detection by using a unified regression objective function in our approach. And a multi-input and multioutput backbone convolutional network is also applied in our approach to make it more representative. Furthermore, we design a two-stage transfer training method to train our URDNet on small annotated GGO data. To evaluate the effectiveness of our proposed URDNet for GGO detection, we conduct GGO detection experiments on the LIDC-IDRI [16], the currently largest publicly available and mostly often used database of lung nodules. The experiment results show that the network is effective and accurate.
Our main contributions are summarized as follows: (1) We present an end-to-end deep convolutional neural network which is unified and only regression for GGO detection in 3D CT scans which leads to outstanding performance of GGO detection on LIDC-IDRI. The resultant detection sensitivity is 90.8% at 1 false positive per scan (2) We introduce a multi-input and multioutput structure for our network's backbone. The backbone not only reserves the subtle locations but also represents the discriminate information of GGO nodules (3) We propose a unified regression objective function for all samples. For positive samples, the position prediction is regarded as only a conventional regression task. For negative samples, zero target is set. The location of negative samples (boxes) will be regressed to a pseudotarget (zero).
(4) We adopt a two-stage training method to train the complicated 3D detection network given a small amount of annotated samples The remainder of this paper is organized as follows. Sections 2 and 3 presents our URDNet architecture and its training method, respectively. The implementation details and experimental results are discussed in Section 4. We conclude in Section 5.

URDNet Architecture
The network architecture is illustrated in Figure 1 and is composed of a backbone network for feature extraction, which is a multiscale input-output structure, and a detection head which is a single prediction module by location regression, which directly generates a fixed set of 3D bounding boxes. Due to the memory limitation of GPU, the input of the method is only a CT cube with a fixed size (128 × 128 × 128 in our experiments). The final detection result is the combination of all the detected GGOs in each cube. Figure 1 illustrates the architecture of our network, which takes as input a 128 × 128 × 128 CT cube. The details of CT preprocessing are introduced here. For a 3D CT scan, a voxel size normalization is firstly performed due to various voxel sizes of subjects. The voxel size is set to 1 × 1 × 1 mm 3 by using bilinear interpolation (after voxel normalization, even the number of slices in axial dimension is greater than 128 for all the data in our experiments). Then, we divide a 3D CT volume into several 128 × 128 × 128 cubes in a slider-patch way. For each cube, it will be inputted into our URDNet and be processed. More details of URDNet are given in the following subsections.
2.1. Backbone Network. The backbone network includes two main symmetric pathways: a contracting pathway and an expanding pathway. The contracting pathway follows the typical architecture of a convolutional network. To aid the network in capturing spatial information at different scales, we construct a multiscale input structure by downsampling the source input and feeding it into each layer in the contracting pathway, not only the first layer. The feed operation is indicated by the green lines in Figure 1. We referred to these lines connecting the source input CT to the layers in the contracting pathway as source connections. The contracting pathway can be divided into five blocks, in which the output feature maps are downsampled by 2, 4, 8, 16, and 32 w.r.t. the input cube size. This block is a composite function of four consecutive operations: 3D convolution (Conv), batch normalization (BN) [17], rectified linear units (ReLU), and pooling (Avg: average or Max: maximum).
The expanding pathway is an information-expanding process that elucidates higher resolution features via an upsampling strategy. The upsampling procedure is implemented by a series of layers including unpooling, deconvolution, BN, and ReLU operations to perform a complicated deconvolution, as described in a previous paper [18]. The expanding pathway is semantically stronger because of the feature map from higher levels (the top of the contracting pathway). In contrast, the contracting feature map consists of lower-level semantics, but its activations are more accurately localized because it was subsampled fewer times. To preserve the localized information, the feature map of the expanding pathway is enhanced with the feature map from the contracting pathway via skip connections. Skip connections associate low-level feature maps across resolutions and semantic levels. Moreover, to create a multiscale feature map that has strong semantics and precise spatial information at all scales, we combine low-resolution and semantically strong features with high-resolution and semantically weak features via a top-down pathway and skip connections. Skip connections [19] are connections that can skip one or more layers. Similar architectures adopting a top-down pathway and skip connections are popular in recent research [20][21][22]. However, only a single high-level feature map of fine resolution was applied for prediction in previous networks. In contrast, our backbone leverages multiscale feature maps in which predictions are independently generated on each level for GGO detection.

Detection Head.
In the traditional sliding-window detection methods, the entire detection space is eventually discretized into a series of windows. Our network also discretizes the output space of bounding boxes into a set of anchor boxes with different scales over multiple feature maps. Each anchor box is a predefined box centered at a location of the feature map and is associated with a special initial scale, similar to the anchor box used in Faster R-CNN [23].
Our regression prediction head predicts bounding boxes based on a fixed set of anchor boxes and is implemented by regressing 3D box relative offsets from anchor boxes to satisfiable boxes (to better match the GGO shape) using small convolutional filters applied to multiple feature maps. These processes are indicated on the yellow lines and a yellow rectangular box in the bottom area of Figure 1. According to the previous section, the feature map of the expanding pathway progressively increases in size. The multiscale feature maps are composed of multiple feature maps of different resolutions, and each feature map can produce a set of detection predictions. To detect GGO of various sizes, the prediction pathway of our network can naturally combine predictions from the multiscale feature maps. Additionally, at each location of the feature map, we simultaneously predict multiple boxes with different scales but the same center. The multiple boxes are parameterized relative to the corresponding anchor    Figure 2 is used to regress to the resultant boxes. Since we consider 3 sizes of anchors and each resultant result is a 4D vector, the outcome of this block is 12 = 3 × 4 vectors for each anchor. The number of anchors for each feature map location must be carefully set to cover a wider and finer range of scale, and more details are provided in Section 4.1.
After the resultant boxes are obtained from the prediction pathway, we perform nonmaximum suppression (NMS) to rule out the overlapping boxes. For each anchor, the prediction finally produces the four-position component map. If the area is background, the corresponding location of the component map is very close to zero. Only the box with nonbackground will be retained and others will be deleted. Then, the retained boxes will be decided as GGOs if their position is larger than a threshold or otherwise as non-GGO (background). Such threshold will be set up in the applications.

Training Method
Training of URDNet is a multitarget regression procedure because it simultaneously regresses the GGO center and the diameter of GGO. The details of the objective of our network is given in the following subsection. Besides, only a small amount of CT data with annotated GGO nodules is given, and the overall network is difficult to converge. We adopt a two-stage transfer training strategy to solve the problem. More useful strategies also are given in the below subsections.
3.1. Loss Function. In our method, a true object can be expressed by a box. Each box corresponds to a 3D square and thus is represented by a 4D vector ðx, y, z, rÞ, where ðx, y , zÞ is the 3D center point and r is the side length of the square. We also use a box to express the position of any background, and the components of this box (x, y, z, r) are set to zero. We define the position of this background as zero target (pseudotarget). The target position of true objects is defined as real target. In this way, both positive samples (target window) and negative samples (background window) can be used as position targets. We can express the detection problem as the same learning target and only need position regression.
The conventional regression loss for positive samples and a new design regression loss for negative samples are included in the overall localization loss.
The regression loss for positives is a modified smooth L 1 loss [24] between the predicted 3D bounding box (denoted as l) and the ground-truth 3D bounding box (denoted as g). Similar to Faster R-CNN [12], it can be regressed to the offset terms for the center (xc, yc, and zc) of the anchor 3D bounding box (denoted as d) and its radius (denoted as r).
whereĝ xc , and the weight term β is set to 0.6 through careful experiments in this paper, which means that we focus additional attention on the center point.
At the same time, because of the characteristics of zero target in our negative samples, we design a loss function which regresses to a zero target (denoted as ο), which means that several position components can be cohered to zero (zero target). To sum up, the full optimization objective is where λ n is the weight for the regression loss for the negative samples and is set to 0.5 in this paper.

Two-Stage Training.
We add a classifier module with a 3D Avg-pooling (4 × 4 × 4) layer and a two-class softmax layer behind the backbone of our network to construct a solo GGO classifier. The 64 × 64 × 64 positive cubes are cropped from the lung scans such that they contain only one GGO nodule. More positive cubes are generated by data augmentation. The 64 × 64 × 64 negative cubes without nodules are randomly cropped. It is easier to construct a relatively large-scale dataset for training the nodule classifier than the detector network. Moreover, the solo classifier can be trained much faster than our network because the input size is 1/8 of the URDNet input size, and it is only a binary classification problem. To avoid slow convergence, the weights are initialized with Glorot and Bengio [25] initialization. The classifier After training the classifier network, we truncate the weights of the classifier network to initialize our backbone and train the URDNet by fine-tuning with a smaller learning rate. We find that the URDNet loss converges quicker than training from scratch. Training our network in this way can effectively alleviate the overfitting problem.

Data Augmentation.
Considering that only a few training samples are available, we tested certain data augmentation operations. We exploit both lossless and lossy augmentation and find the following two simple but useful augmentations for GGO nodule patches: (1) Flipping. The patches are randomly flipped with respect to coronal, sagittal, and axial dimensions (2) Resizing. The patches are randomly resized with a ratio between 0.9 and 1.1 We also tested other augmentation methods, such as axis swapping and rotation. However, it has no significant effects. Furthermore, to boost nodule classification performance, we adopt an online hard negative mining technique, similar to the online hard example mining (OHEM) method proposed in a previous paper [26]. We use online hard negative mining in the training procedure. First, we process the input patches using the network to conduct a forward pass and obtain tens of thousands of detections, each of which is associated with a box with a classification confidence score. Second, the acceptable N negative samples are randomly selected from these boxes as a candidate pool. Third, the negative samples in this pool are sorted in a descending order based on their overall loss, and the top K samples are selected as the hard negatives.

Experiments
In this section, we first describe the experimental setup. We introduce our dataset's details. The use of anchor boxes improves the speed and efficiency for the detection portion of a deep learning neural network. Due to its importance, we then give more details about anchors. At last, we give the results of our two groups of experiments. One group is used to evaluate our URDNet, and the other group is used to compare the performance with that of other systems.

Experimental Setup
4.1.1. Dataset. The dataset is collected from the largest publicly available reference database for lung nodules, the LIDC-IDRI [16]. The LIDC-IDRI database contains a total of 1018 CT scans. Each LIDC-IDRI scan was annotated by experienced thoracic radiologists using a two-phase reading process. Four radiologists annotated scans and marked all suspicious lesions as nodule ≥ 3 mm, nodule < 3 mm, or nonnodule. We only considered the GGO nodules. Simultaneously, we only considered the nodules annotated by the majority of the radiologists (at least 3 out of 4 radiologists) in our standard reference. 302 CT scans from 299 patients were collected from LIDC-IDRI. In these CT scans, 635 GGO nodules (271 nodules annotated by 3 radiologists and 364 nodules annotated by 4 radiologists) were found. The diameter of the nodules varied from 3 to 34 mm with a median of 10.3 mm. We use 250 CT scans as a training set and the remaining 52 as a test set. A slice from one patient with a single nodule location is shown in Figure 3.

Anchor Configuration.
All the scales of information decoded in the expanding pathway of our URDNet, including 32 × 32 × 32, 16 × 16 × 16, and 8 × 8 × 8, are used and inputted into the detection head. The numbers of anchors' sizes are set to be 1, 3, and 5 for these three scales, respectively, which lead to 9 anchors' sizes in total: 4,6,8,10,12,16,20,26, and 32 mm. The details are listed in Table 1. To sum up, we will generate 47616 ð= 32 3 × 1 + 16 3 × 3 + 8 3 × 5Þ anchors for each CT cube, based on which 47616 resultant boxes will be computed. We find that the detection scale range is between 3 mm and 33 mm, which is larger than the scale range of the anchor box, in our test set. Benefitting from the

Experimental Results.
In our experiments, we use a freeresponse receiver operating characteristic (FROC) analysis [27] on the filtered GGO dataset from LIDC-IDRI [16] for comparison. In the FROC curve, the sensitivity is plotted as a function of the average number of false positives per scan (FPs/scan). In this work, the sensitivity is defined as the fraction of detected true positives divided by the number of ground-truth GGO nodules. The FROC overall score is defined as the average of the sensitivity at seven predefined false-positive rates: 1/8, 1/4, 1/2, 1, 2, 4, and 8 FPs per scan. This performance metric was introduced into the ANODE09 challenge and referred to as the competition performance metric (CPM) in a previous paper [28].
We first conduct three experiments using a different backbone or head detection to evaluate the effect of our network. We use the same training dataset and data augmentation strategy. Other hyperparameters of the training network are also shared, except for specified changes to components. The SSD method predicts bounding boxes and confidence scores based on a fixed set of anchor boxes, which directly related to our URDNet's head detection. In contrast, U-net has an elegant decoder-encoder structure, but it only uses the last feature map for biomedical applications and ignores the multi-input and multioutput structure. Table 2 lists the results in the GGO detection task.  Figure 4: Examples of the detection by URDNet (the white rectangles denote the ground-truth boxes, the green rectangles denote the detection results, and they are zoomed at the top-right area or the left-bottom area, and the red rectangles denote the wrong results). 6 Wireless Communications and Mobile Computing It is obvious that our URDNet achieves the highest sensitivity (93.5%) at the lowest FPs/scan (6.8) among these detection experiments, which demonstrates the superiority of our detection network. Certain examples that are correctly detected are illustrated in Figure 4(a). These results indicate that URDNet can accurately locate the centers of GGO nodules and regress the size of GGO nodules. Examples of GGO nodules false detections are also shown in Figure 4(b). Typically, these nodules are notably low in contrast (pure or diffuse) or located close to the other tissues (blood or chest wall) and can be considered notably low-quality GGO nodules. To further improve the detection performance, additional discriminative features must be learned by a new learning method.
Although a few methods have been developed for GGO nodule detection, it is trivial to compare all other methods. In this paper, we choose three GGO detection systems or methods from different categories for comparison. We first select the SubsolidCAD [29] system, which is a state-of-theart conventional method that uses 4 categories of handcrafted features to describe the appearance or the internal characteristic of the GGO. The system can reach a sensitivity of 80% at an average of 1.0 false positives per scan with a CPM [28] of 0.734. We also compare our performance with the Aidence [30] system based on convolutional networks, which is the strongest competitor and the top performer in the LUNA16 Challenge. Referring to the report [30], the best score was achieved by Aidence with a CPM of 0.764 in the GGO nodule candidate detection task. The Aidence system uses end-to-end convolutional networks that are trained on a subset of studies from the National Lung Screening Trial [31]. Last, we chose the S4ND [32] method and GA-SSD [33] method for lung nodule detection to compare which are current state-of-the-art methods. The S4ND method is a single-shot and single-scale method, while GA-SSD is an improved method based on SSD by implementing the atten-tion mechanism. Additionally, we list a comparison of performance among our URDNet, SubsolidCAD, Aidence, S4ND, and GA-SSD in Figure 5. We observe that our URD-Net attained superior performance. The CPM can reach 0.874, surpassing the SubsolidCAD system (CPM: 0.734) with relative performance gains of 19.07%, the Aidence system (CPM: 0.767) with 13.95%, S4ND (CPM: 0.866) with 0.92%, and GA-SSD (CPM: 0.855) with 2.26%, respectively.

Conclusions
In this paper, we present a 3D convolutional detector network known as URDNet that was constructed of a multiscale input-output U-shaped network for GGO detection in CT images. A unified regression objective function is proposed in URDNet in which the location of an object is focused on during learning that can directly regress a fixed set of 3D boxes for all samples. Furthermore, a two-stage training method is designed to help our complicated 3D detector network converge and prevent overfitting, even if given a small amount of annotated GGO nodule CT images. By this training method incorporated with data augmentations and hard negative mining, our network can be efficiently and effectively trained in an end-to-end manner for GGO detection.
We believe that URDNet offers a useful tool for GGO nodule detection in the clinical diagnosis of lung cancer. Moreover, our independent GGO detection algorithm can be easily integrated into the existing lung nodule CAD systems to boost the overall system performance. Immediate future work will extend our network for the detection of other nodules. Because this method does not require large amounts of labeled data and has a simple, unified regression objective, it is easier to be applied to other nodule processing tasks in medical IoT systems.

Data Availability
The data used to support the findings of this study are from previously reported LIDC-IDRI dataset, which is the currently largest publicly available database. The data are available at relevant places with text as reference.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.