Adaptive Localizing Region-Based Level Set for Segmentation of Maxillary Sinus Based on Convolutional Neural Networks

In this paper, we propose a novel method, an adaptive localizing region-based level set using convolutional neural network, for improving performance of maxillary sinus segmentation. The healthy sinus without lesion inside is easy for conventional algorithms. However, in practice, most of the cases are filled with lesions of great heterogeneity which lead to lower accuracy. Therefore, we provide a strategy to avoid active contour from being trapped into a nontarget area. First, features of lesion and maxillary sinus are studied using a convolutional neural network (CNN) with two convolutional and three fully connected layers in architecture. In addition, outputs of CNN are devised to evaluate possibilities of zero level set location close to lesion or not. Finally, the method estimates stable points on the contour by an interactive process. If it locates in the lesion, the point needs to be paid a certain speed compensation based on the value of possibility via CNN, assisting itself to escape from the local minima. If not, the point preserves current status till convergence. Capabilities of our method have been demonstrated on a dataset of 200 CT images with possible lesions. To illustrate the strength of our method, we evaluated it against state-of-the-art methods, FLS and CRF-FCN. For all cases, our method, as assessed by Dice similarity coefficients, performed significantly better compared with currently available methods and obtained a significant Dice improvement, 0.25 than FLS and 0.12 than CRF-FCN, respectively, on an average.


Introduction
Nasal diseases are growing common for individuals with serious impacts on daily life. In America, there are about 16% adults suffering from this trouble, and there are 100 million in China [1]. Acute sinusitis is simple for management with antibiotic drugs [2]. However, for chronic one, functional endonasal sinus surgery (FESS) may be the only solution of relief. FESS contains some processes: maxillary sinus fenestration, ethmoid sinus resection, etc. [3]. It is found that a lot of risks may appear in surgery due to great variations of nasal anatomy, where optical nerves and carotids have higher possibility to be affected [4]. Segmentation and the subsequent quantitative assessment of lesions in medical images provide valuable information for the analysis of diseases, and quantitative imaging can reveal clues about lesion characteristic and anatomical structure. For example, preoperative computed tomography (CT) affords radiologists the opportunity to prospectively identify anatomic variant that predispose patients to major surgical complications [5]. A real-time surgical navigation has joined in FESS [6]. To make FESS safer, surgeons use navigation systems that register a patient to his/her CT scan and track the position of tools inside the patient [7].
ere are growing evidences that the qualification of sinuses increases insights into functional outcomes and requires accurate segmentation which is a challenging task for a number of reasons. e heterogeneous appearance of lesions including large variability in location, size, shape, and frequency makes it difficult to design effective segmentation rules. Although the most accurate segmentation results can be obtained through manual delineation by an experienced expert, it costs plenty of tedious time, nearly, 8-10 hours per case of one patient [8]. Besides, the expertise decides whether a particular region is a part of the segmentation object. In order to understand the complexity of maxillary sinus, it is necessary to conduct wide studies to gain statistical pattern for drawing conclusions. erefore, a more accurate, automatic segmentation algorithm has been required as soon as possible. Figure 1 illustrates some of the potential challenges when proposing a computational approach for the task of automatic maxillary sinus segmentation. e figure demonstrates intensity statistics of great difference and shows examples of lesions in the cavity of maxillary sinus. Lesions can be found at multiple sites, with different shapes and sizes which misguide curve evolution as a result of high gradient noisy areas. Meanwhile, it is generally difficult to derive statistical prior knowledge to set up a similar description for shape of maxillary sinus due to common anatomical abnormality [9]. Ideally, an acceptable solution is able to adjust itself to foreground by learning from potential features in a few of examples.
Level set is popular as a curve evolution application, especially for medical image segmentation [10]. Compared with active contour methods based on points' motion, it has the ability to handle image noise, intensity heterogeneity, and discontinuous object boundaries. Level set methods include models based on edge [10][11][12] or region [13][14][15]. Edge-based models are sensitive to noise and objects with incomplete boundaries or low contrast texture. Region-based ones study spatial statistics of interesting regions to find global minima of energy functional but lack competence in details. While, in many examples of great heterogeneous foreground and background, a local region model has a better performance than the global one [16][17][18][19][20]. Hybrid models are superior, defining an energy functional with local and global constraints to obtain a more robust segmentation with little sensitive initialization.
In level set works, the choice of functional control parameters is puzzled for a long time. Li et al. [21] give proof that an inappropriate definition may lead to an inferior segmentation regardless of initialization. As Lankton et al. described [16], parameters have close relationships with direction and speed of active contour according to specific texture feature. A slight change in the group of parameters even produces entirely different results. erefore, the fixed parameters cannot survive under different spatial distributions of intensity in maxillary sinus with lesion. Some researches tend to test an optional value over a series of training set for the entire database of images, but usually, new images may require additional experiments to find the best-fitted parameters. As a result, choosing a fixed set of parameters by trial is a time-consuming and laborious process. In addition, most users do not have enough experience to tune a large number of parameters optimally. For this reason, an adaptive strategy is absolutely desirable. If, in the process of segmentation, parameters could be adjusted dynamically based on specific texture, the approach has a higher chance of providing a more accurate segmentation.
Some papers present an algorithm to estimate parameters for energy functional before segmentation. Li et al. [21] introduce a method to adjust parameters of the level set model according to classification of pixels with intensity. Unfortunately, the value is initialized once at the beginning of work and remains constant all the time, which causes a poor segmentation especially for the object with great variant lesions. Oliviera et al. [22] provide a mechanism to evaluate parameters from analysis of the training set which also can be reused in new images. It is obviously inappropriate for sinus cavity with highly diverse lesions. Baillard et al. [23] devise a solution for the problem of spatial heterogeneity and take full considerations for every point position in every iteration. at is, a point on the contour at first determines its status. If it belongs to the object, it should locally extend outwards. And, if not, it should move in the opposite direction. is classification depends on Gaussian and the shifted Rayleigh statistical distribution models via training set by maximizing the posterior probability. However, because of the diversity of lesion, its prediction accuracy of position would be doubted and discussed depending on the conventional machine learning method. Consequently, [21][22][23] are likely to perform unacceptably for highly diverse datasets and need more optimizations.
At the same time, deep learning techniques have appeared as a powerful alternative for supervised learning applications such as classification, detection, and other areas [24][25][26][27]. Convolutional neural network (CNN) [28,29] can learn highly discriminative features and have been widespread with predominance on a variety of problems including medical imaging. Ciresan et al. [30] adopt classical CNN for the segmentation of the neural membrane. Obviously, this strategy has two drawbacks. At first, it costs so much computing time since the network must be run separately for each patch, and there is a lot of redundancy due to overlapping patches. In addition, there is a trade-off between localization accuracy and the use of context. For more accurate localization, Ronneberger et al. give a more elegant architecture U-Net derived from fully convolutional network (FCN) [31], which consists of a contracting path to capture context and a symmetric expanding path. Although the algorithm gains a certain achievement, the result of the object specifically in edge is too coarse to be accepted for medical image segmentation. Kamnitsas et al. [32] incorporate conditional random field (CRF) into FCN. Details of segmentation are refined by fuzzy classification based on relations between position and class of pixels in future. It is a pity that this method suffers from noises seriously. Assaf et al. [33] propose to predict the position of active contour with CNN. As this method focuses on entire contour status on average, it cannot deal with the irregular shape of object particularly. ese methods are proved successful in computer vision applications on medical images. e idea of CNN strengthens knowledge of significant features on large scale of the training set, while a big challenge for appearance or texture with great distribution in space has come up with maxillary sinus segmentation. It is impossible to set up a class-balance training set to describe all possible cases with lesion.
In this paper, we present a novel improvement of the level set segmentation using the convolutional neural 2 Computational Intelligence and Neuroscience network. Our method is a multistage process. A convolutional neural network is used to identify the location of a point on the zero level set contour. If the point has high probability close to lesion area, its speed can be compensated to a certain extent based on output probabilities of CNN. If not, we believe it has arrived in the target area. is interactive procedure happens when minimizing the cost functional over iterations of the level set. It is unnecessary for a more accurate initial contour at the beginning of segmentation. Contrary to current hybrid level set frameworks, our method has little dependence on special initialization and does not include any assumptions about sinuses and lesion characteristics. erefore, our proposed algorithm has the strong adaptiveness in diverse datasets of maxillary sinus with heterogeneous lesion that include plenty of noise and abnormal anatomical structure.
To the best of our knowledge, this is the first use of CNN to identify location of points on the zero level set with some magnitude of speed compensation, avoiding energy functional from being trapped into local minima and resulting in a generalized segmentation solution than methods available to date.

Energy Models.
We used two different energy models to extensively evaluate our proposed adaptive localizing region-based segmentation algorithm: uniform modeling energy (UM) and mean separation energy (MS). In this paper, the innovative method is optimized from local version of the level set [16], and energy is given as follows: where x and y denote the position of specific pixel located in the coordinate of current image, B(x, y) represents local regions centering at every point on the contour, such as a rectangular or circle, and the function F is a generic internal energy measure used to describe local adherence to a given model at each point along the contour. In order to keep the curve smooth, we add a regularization term as is commonly done. erefore, the UM or MS model is selected as possible candidates for F to be implemented in our proposed method. By taking the first variation of (1) with respect to ϕ, we obtain the following evolution equation: Computational Intelligence and Neuroscience 3

Uniform Modeling Energy. A well-known example of an energy that uses a constant intensity model is the Chan-
Vese energy [13], which we will refer to as the uniform modeling energy described as e energy model divides region of interest (ROI) into foreground and background. u and v represent their intensity means, respectively. Set Ω as a bounded subset in R 2 and I(y) as the coordinated of a point on image I. Let ϕ(y) be a signed distance map and H ϕ(y) be the foreground region. A local version of the UM model can be used by replacing u and v with their local versions, u x and v x , to represent the local means of a region divided into exterior and interior surrounding each contour point, according to equation (4). λ 1 and λ 2 are parameters that have impactions on speed direction and magnitude which have close relationship with the texture of ROI. e related derivation process could be provided in [13]:

Mean Separation
Model. e mean separation model is first proposed by Yezzi et al. [34] which we refer to as mean separation energy: is energy functional arrives in minima when foreground and background regions have maximally separate mean intensities. ere is a strong assumption that the object and its background have the largest difference of intensity.
ere is no restriction on how well regions are modeled by u and v. In our method, we design the local version of F as By substituting the derivative of F MS into (2), we obtain the local region based on speed flow. is allows the MS model to find image edges effectively without considering the uniformity of global internal or external regions.

e Proposed Method.
e proposed method contains an interactive hybrid of the localizing level set and CNN, illustrated in Figure 2, which estimates the probabilities of points' location and helps the energy functional escape from local minima effectively. From borrowed spirits from previous work [35], we initialize the level set model with a contour surrounding the segmentation object. In general, the contour should move to the target as the required speed as level set functional. However, as the result of noisy disturb, the moving contour may be trapped in the local noises. erefore, we predict the real target possible location and suggest new speed direction and value to accelerate the moving contour moving state evolving more reasonably.

Speed Compensation.
As introduced in Section 1, difficulties in the segmentation of maxillary sinus are lesions with uncertain positions, shapes, and intensities. Certainly, a healthy one can be dealt with readily. Holding higher gray value, lesions are easy to frustrate evolution of the initial contour that hardly arrives in the target. In our optimization, we propose a solution of speed compensation to resolve this problem. at is, if speed of a point on the contour is detected close to zero, a devised convolutional neural network is required to identify features of its local region. For more accurate result, we train it on a large scale of the training set. About details, we will discuss it later. e CNN outputs each of two classes: boundary of sinus (p 1 ) or lesion (p 2 ). In the interactive step, we use the probability value of two classes to give speed compensation, which is calculated using the following equation based on the MS model: where λ 1 impacts the weight of moving the curve outward along its normal and, alternatively, λ 2 tries to move the curve inward. Both relative items interact on each other and decide ultimate magnitude of speed. exp(|λ 1 + λ 2 | (1/2) )(((1 + p 2 )/ (1 + p 1 )) − (1/2)) is the term of speed compensation. Our research relies on an important assumption that any initial contour is given inside sinus cavity, since there are plenty of other sinus components outside bringing with impossibility to automatic segmentation. For a stable point, if p 2 ≫ p 1 , it can be paid for a great of velocity and tend to move far away.
If it locates in the region of the maxillary sinus boundary, p 2 ≪ p 1 , the compensation is nearly zero, maintaining current status till convergence. We drop the confused condition, |p 1 − p 2 | ≤ 0.2, without considerations in order to make sure of stability, which is found in very few cases in validation experiment.

CNN Architecture.
For the research of medical image segmentation, the most unsatisfied results are caused by heterogeneous texture feature that leaves the energy functional in local noisy minima. Our method generalizes this problem into a machine learning process with a common architecture for the CNN. e proposed architecture consists of two convolutional layers followed by three fully connected layers with outputs of two classes (Figure 3). e input of the image is sampled by a point's local region, and its size is fixed on 32 × 32. ere are two convolutional layers at the beginning of our proposed CNN architecture, which includes a 5 × 5 filter and a 2 × 2 max pooling filter, respectively. e difference between two layers is the depth and stride. All hyperparameters we prefer have been tested to perform better than others. A nonliner activation function is applied to the outputs of the convolutional layer. In this paper, we choose Leaky ReLU [36] for network to give a sparse quality decreasing computing complexity and keeping backward propagation running smooth. To avoid overfitting and increase accuracy of outputs, we design a batch normalization (BN) filter [37] before ReLU configuration, which also resolves the problem of different distributions in training set or prediction. Each convolutional block of our CNN is illustrated in Figure 4. We give this light model as considerations of computation speed and the weight of our proposed program. A more complicated CNN should be offered, but it will affect the efficiency of the level set model that accounts for more computation costs.

Training the CNN.
In the project, the cost function of CNN is selected as cross-entropy loss, which evaluates the performance of the network after each batch: where t represents the true label for every training example and f(ω, x) is the prediction function. When the n th example is classified into m th , T m,n equals 1. Otherwise, T m,n equals 0. p m,n is the probability of the n th example being classified into the m th class. In order to prevent overfitting, L2-regularization is introduced to penalize the size of the weights in f(ω, x), where η � 1.0 is the coefficient of regularization. In the model parameters' learning, we adopt the strategy of moving average, including BN filter and stochastic gradient descent (SGD), and decay is set as 0.9997. With the same decay factor, for searching global minima effectively, every 10000 training steps, we adjust the learning rate exponentially based on the number of steps. And, in fully connected layers, we set dropout and keep probability 0.6 to make sure a sparse network for an acceptable result. To the initialization of model parameters, truncated Gaussian weight distribution is used for generation and standard deviation is 0.1. e CNN is trained with minibatch stochastic gradient descent with a batch size of 32 images.
Toolkit of experiments depends on SLIM, a high level encapsulation of Tensorflow.

Without Reinitialization of Level Set and Narrow Band.
Conventional level set has a boring drawback that the necessary signed distance function (SDF) cannot maintain special characteristic after some iterations. Common solution relies on reinitialization recurrently which costs plenty of computation time and produces numerical errors frequently. Li et al. come up with a regularization idea to resolve this problem [38] effectively. By appending a SDF regularization term on objective energy functional, the evolutionary process of zero level set contour is able to keep quality of SDF without reinitialization so that the complexity of algorithm has been reduced significantly and the final result is prominent among the peer. To save more computation time, the proposed method calculates the energy functional only for grid points located within a narrow band of the distance map, since localizing region-based level set [16] focuses on the local interior and exterior of points on the zero level set contour. Besides, through experiments, we found that the window size of these points outperforms nearly less than 10, a really 'narrow' band ROI.

Training Set and Image
Preprocessing. e training set of two classes is divided into lesion or not. As we use CNN to analyze the static point's local region feature, it certainly stays at the edge of sinus cavity or lesion. Consequently, in detail, when constructing the training set, we sampled points on both conditions in average. e size of example is 10 × 10 and amplified to 32 by bilinear interpolation. During experiments, we tried different sizes and this group performed outstandingly. e number of training patches with two  Computational Intelligence and Neuroscience classes is collected as 50 thousands, respectively. For inputs of neural network, normalization of gray values is necessary for speed and accuracy. Common strategies are often influenced by noises greatly and insensitive to significant feature. We apply contrast-limited adaptive histogram equalization (CLAHE) that enhances two classes' different texture feature expressions and reduces interference of trivial noises. Figure 5 shows a subset of the training set and corresponding preprocessed feature maps.

Data Augmentation.
For enlarging the limited training set, the augment dataset was created by applying a combination of elastic and affine distortions. We created elastic distortions by generating stochastic displacement field with values within the range of [− 1, 1], convolving these fields with a range of Gaussian filters, and multiplying the resulting matrices by a range of constant factors, controlling the intensity of the deformation. We also tried to adopt rotation to augment training batches. e result showed this approach is limited in improvement of training quality. erefore, we did not introduce this design at last. In the Section 1, we demonstrated that the size, location, and heterogeneity of lesion have high diversity of characteristics. A wide extent was found in the full set of images. erefore, using a fixed group of parameters in energy functional for all cases of maxillary sinus is not feasible in practice. In other words, a novel solution should be supplied to stop points from falling into local minima. A wide range of lesion's intensity distribution illustrates the importance of and need for speed compensation based on convolutional neural network that evaluates feature of regions and achieves global optimum.

Results and Discussion
For evaluation, two radiologist supplied annotations of manual segmentation as ground truth with more than five years of experience. e final result of our proposed method was quantitatively compared with the average of the two radiologists' marking. For the research on sensitivity of segmentation to varied initializations, in every image, we gave five radii 3, 5, 7, 9, and 11 pixels. All initial contours are located inside cavity of sinus. Some were close to the center and others near the sinus boundary. is broad range of initializations allowed us to evaluate robustness of our algorithm in any case.

Segmentation Performance.
Energy functional of parameters μ 1 � 0.05, λ 1 � 1.0, and λ 2 � 1.0 was used, as we tested them with the best performance for all 660 CT images without speed compensation. Figures 6(c), 6(f ), 6(i), 6(l), and 6(o) show some examples of segmentation for different cases. Segmentation performance was assessed using the  Figure 3: CNN architecture. e input is an 32 × 32 image of a point's local region. Two convolutional components include a 5 × 5 filter and 2 × 2 max pooling. ree fully connected layers contain 1600, 500, and 2 nodes, respectively.  Computational Intelligence and Neuroscience Dice similarity coefficient (Table 1). e Dice coefficient was calculated relative to each radiologist's manual marking, and then, an average Dice score was estimated. Our proposed segmentation method has high agreement with the manual markings for different local energy models. Results were better than the overlap that was measured between the complete manual annotations of both radiologists, thus demonstrating the strength of our proposed method.

Comparison with Fixed Contour Parameter Method.
We compared our method with a state-of-the-art work energy model of localizing region-based level set based on fixed parameters [16] (FLS). In this experiment, we chose μ 1 � 0.05, λ 1 � 1.0, and λ 2 � 1.0 as parameters of energy functional, since they had the best performance in all training set without speed compensation. Figure 6 shows some cases of different maxillary sinus, initial contour, and final segmentation of the object, using both our proposed method and FLS. e left column supplies a series of CT images with lesions that have uncertain sizes, intensities, and locations obviously. erefore, FLS has higher probabilities to induce the initial contour trapped into edge of lesions. e right column proves our method's advantage. For quantitative evaluation two method, we acquired another 200 CT images as testing set. And, the Dice coefficients were averaged on five initial contours. e statistics shows that FLS has average Dice coefficients 0.60 ± 0.12 and 0.63 ± 0.10 for the UM and MS model, respectively. ese Dice coefficients were significantly lower than the proposed method. Figure 7 clearly shows that our method outperforms the state-of-the-art FLS. Figure 8 demonstrates the convergence of FLS and our proposed algorithm on iterations of energy functional. At the beginning of running, both of them had a tendency to increase a little and FLS's extent more great. As expected, energy decreased with increasing iterations, converging on a single value; this implies minimization of the energy functional. For both metrics, substantial convergence was obtained after more than 60 iterations. Obviously, our proposed method had a faster speed of convergence. If adjusted to a proper larger length of the step, both of energy functional needed fewer times of iterations.

Comparison to Other State-of-the-Art Methods.
We compared ours to another state-of-the-art approach proposed by Kamnitsas et al. [32]. e author presented a novel model integrating fully conventional network (FCN) and conditional random field (CRF) for segmentation of the medical image. is paper devised an cost function which joined CRF item to describe more accurate classes of pixels in neighborhood, removing false positive effectively. In addition, they employed a dual pathway architecture that processes the input images at multiple scales simultaneously for accurate localization of the object. eir innovative works improved segmentation skill based on full deep learning and won growing attentions.
We tested Kamnitsas' method (CRF-FCN) of 2D mode on the same training or testing datasets. As CRF-FCN is fully automatic, it did not need any initial contours. On configuration of CRF-FCN hyperparameters, we followed the authors' idea including 50 trees and maximum depth of 30 in random forest baseline. CRF-FCN supplied 90% confidence interval of 0.75 ± 0.12 with a large distribution of Dice coefficients.
ese results enhance the strength of our proposed method, which is significantly better than both FLS and CRF-FCN. Figure 9 illustrates results of the CRF-FCN method, performances of which had completely different accuracies Computational Intelligence and Neuroscience 7     there is another sufficient testing database, it should have a chance of improvement in evaluation. In a word, independent identical distribution (IID) of training or testing plays an important role especially in segmentation based on the neural network model.

Computational Time.
We examined the computational time required by the proposed method to analyze segmentation. All data were trained and predicted using Tensorflow 1.2, and the process of level set happened in MATLAB R2017b 64 bit on Mac OS. Compared with FLS, our method needs an additional procedure of estimation for points till convergence of energy. In addition, the duration of one prediction for stable points on the zero level set by CNN was around 0.2s. In statistics, our proposed method costs 30% longer than FLS on average.
3.6. Sensitivity to Initialization. Different initializations of contours have great impacts on FLS [39,40]. In the experiment, we evaluated FLS and ours on five different initial contours on which total Dice coefficients were averaged (Table 2). Our proposed method showed better agreement with the manual marking and smaller changes in the segmentation performance when it was applied using different energy models, significantly better than FLS. e result revealed that despite with noisy lesions, our model can deal with substantial deviations of the location of the initial contour, holding great robustness.

Conclusions
Segmentation of maxillary sinus with lesion faces a problem that size, location, and heterogeneity are seriously irregular so that the state-of-the-art method cannot survive. Most of them are based on gray gradient which result in the active contour falling in local minima of lesion edge. How to escape and arrive the object is the key of our research.
In this paper, we present a novel method of an adaptive localizing region based of the level set using convolutional neural network. e algorithm automatically analyzes the region feature of a stable point on the active contour in every iteration. If it locates in the lesion, its speed could be compensated based on the probability of outputs by CNN. If not, it stays still till the end of iteration. e proposed mechanism makes sure the trapped point could evolve outward. erefore, our method is adaptive to cases of sinuses with possible lesions. Our proposed method shows high agreement with expert manual marking for a diverse dataset of CT images. e variety of spatial texture characteristics in our datasets emphasized the strength of our adaptive method, which performed well with inhomogeneous lesions and with noisy backgrounds.
We compared our results to uniform modeling (UM) and mean separation (MS) models of local version [16]. With the testing set of 200 images, Figure 6 demonstrates that fixed parameters of the localizing region-based level set cannot resolve problems, and most contours stop in local minima, especially illustrated in the cavities filled with more lesions. Moreover, our method has obvious predominance over FLS and is confirmed by statistics on quantitative evaluation ( Table 2).
Kamnitsas' method (CRF-FCN) [32] is very popular recently and derived from fully convolutional network and conditional random field ideas. It has been proved an efficient and effective algorithm compared with the other stateof-the-art method. With contrast experiment, our proposed method also outperformed significantly and drawbacks of CRF-FCN had been stated. Great variations of appearance in maxillary sinus denote CRF-FCN's frustration furthermore.
Besides, we discussed the convergence of FLS and our method. Figure 8 shows that ours has a faster speed of convergence although it costs 30% more longer computation time on an average. Meanwhile, experiments also gave the confidence of insensitivity to initialization of contours. It implies a great potential of our method in full automatic segmentation. Consequently, our combination of deep learning and level set captures the benefits of both approaches and overcomes their limitations, to achieve significantly better results than either method alone. e presented work has some limitations. First, whether μ 1 , λ 1 , and λ 2 of energy functional are sensitive to our method is not discussed in the work. In addition, 3D segmentation based should be a future direction, as well as incorporation of full automatic segmentation, without any user input. In summary, the method presented shows more advanced skills compared with the state-of-the-art level set methods. It performed better in heterogeneous texture between foreground and background, providing a new area of research.

Data Availability
e experimental data used to support the findings of this study are available from the corresponding author upon request.