A Robust Image Segmentation Framework Based on Nonlocal Total Variation Spectral Transform

School of Mathematics and Statistics, Nanjing University of Information Science and Technology, Nanjing 210044, China Department of Clinical Immunology, Xijing Hospital, Fourth Military Medical University, No. 127 Changle West Rd., Xi’an 710032, China Jiangsu Engineering Center of Network Monitoring, School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing 210044, China


Introduction
Image segmentation refers to partitioning images into multiple homogeneous parts or objects. It plays a significant role in a broad range of computer vision applications, including scene understanding [1], image compression [2], and image retrieval [3,4]. To date, two categories of segmentation methods have been widely proposed: data-driven methods [5][6][7] and model-driven methods .
Among data-driven methods, the common strategy is to extract the semantic features of images using deep convolutional neural networks, based on which each pixel can obtain a semantic label to realize segmentation. e popular deep neural networks for semantic segmentation consist of FCN [5], U-Net [6], SegNet [7], etc., which can obtain satisfying segmentation results without any postprocess techniques. However, deep neural networks often suffer from high computational resource consumption and need a great mass of labeled data. Moreover, the interpretability of neural networks is always an Achilles' heel. erefore, model-driven methods are our research centrality.
According to different segmentation strategies, modeldriven methods can be further categorized as boundarybased methods, region-based methods, hybrid methods, and transform-based methods. Boundary-based methods separate objects from the background by edge or shape. e representative methods include edge detection [8][9][10] and graph-cut methods [11,12]. e former uses intensity discontinuity to segment an object. Common edge detection operators contain Prewitt [8], Sobel [9], Roberts [9], and Canny [10]. Compared with the edge detection approaches, graph-cut-based methods can achieve better segmentation accuracy. Nonetheless, the extraction of gradients is sensitive to noise, which makes the boundary-based models produce unsatisfying segmentation results for noisy images.
Region-based approaches recognize similar regions and complete segmentation by means of statistical techniques.
e Chan-Vese model [13] and FCM [14] are representative works. e Chan-Vese model makes the contour curve close to the object boundary by minimizing the energy on both sides of the evolution curve [15]. Nevertheless, the Chan-Vese model fails to obtain satisfying results because of the intensity inhomogeneity. FCM improves the tolerance to ambiguity and obtains more reasonable segmentation results by introducing a membership matrix. However, FCM is unrobust to noise because of the fact that it merely considers gray-level information. To solve the problem, many variants of FCM [16][17][18][19] have been developed, which bring good segmentation performance. Nonetheless, the improved methods are still sensitive to the complex background and intensity inhomogeneity. Hybrid methods employ boundary information to detect the region of objects and then use region information to preserve the boundary structures. Recently, transition region (TR)-based image thresholding [20][21][22][23] has been proposed as a type of hybrid method. e method, firstly, uses edge detectors or statistical techniques to extract a transition region, which is a structure similar to the image edge, and then, it segments the image by a threshold, which is a gray level mean value of the transition region. TR-based image thresholding additionally exploits the spatial information to acquire more satisfying segmentation results. However, it is a global thresholding method, which is unrobust to intensity inhomogeneity. e aforementioned model-driven methods segment the image using spatial features, which results in sensitivity to noise. Differently, transform-based approaches, firstly, transform the image to a specific domain according to mathematical theories, where noise and image details have different performances. en, denoised images are obtained by filtering and inverse transformation, on which postprocessing is performed to segment the image. As one of the popular transform approaches, wavelet transform is widely used in diverse computer vision tasks because of its ease of use and multiresolution processing ability. e common operation of the wavelet transform in image processing is to decompose the image to obtain multiscale sub-bands in the wavelet domain with the help of Mallat's pyramid algorithm [24]. en, filter the image by low-pass, band-pass, or highpass filter to obtain the required features. Finally, the processed image can be obtained by inverse transform. To get satisfying segmentation results, wavelet transform is often combined with other segmentation methods, such as watershed segmentation [25], clustering approaches [26], and image thresholding methods [27]. For instance, the method in [25], firstly, decomposes the original image into a multiscale pyramid representation in the wavelet transform domain. Secondly, the watershed algorithm is applied to segment every image of the multiscale pyramid into several regions, including objects and background.
irdly, the reverse wavelet transform is conducted on the split regions to get the next higher resolution representation. Finally, the size of split regions gradually becomes the same as that of regions in the ground truth to achieve the segmentation result. Nonetheless, wavelet transform-based methods are sensitive to contrast, and the segmentation results are influenced by the selection of wavelet basis functions.
Recently, the NLTV spectral theory has been introduced [28] and has attracted people's attention. e NLTV spectral transform can transform the image from the spatial domain to the spectral domain, in which objects with different contrast, size, and detailed structures can be distinguished well. Additionally, the NLTV spectral transform can preserve image structures because of its nonlocal operators [28]. To this end, we further discuss the performance of NLTV spectral theory and attempt to further enhance the applicability of the NLTV spectral transform. Inspired by the work [29], we demonstrate the sensitivity of the NLTV spectral transform to size, contrast, and its detailed structures in images with or without noise. We also indicate that the spectral transform is invariance to rotation and translation. Besides, we are motivated to put forward a robust image segmentation framework with NLTV spectral transform. e main process is as follows: firstly, the NLTV flow is imposed on an image to acquire the NLTV spectral transform, by which spectral response and a salient time map of the image are calculated. e elements in the salient time map represent the max response time of each pixel of the image. Secondly, we filter the salient time map by a Gaussian filter to remove the isolated points and perform a least-squares regression using a polynomial on the filtered map to fit a separation surface. irdly, the image is filtered by the surface in the NLTV spectral domain, followed by the NLTV inverse transform to obtain a rough segmentation result. Finally, we use morphological operators and a binary process to refine the segmentation result.
It should be noticed that the total variation (TV) spectral transform-based method [30] has a similar idea in segmenting images with noise. However, the TV spectral transform used in [30] calculates the horizontal and vertical gradient of every pixel, which means only local information is selected to describe object features. In reference [30], the TV flow is obtained by iteratively solving the ROF model, and then the TV spectral transform is yielded. Considering that the edge detail of objects is lost for solving the ROF model, the guided filter is adopted to refine the object edge in [30]. In contrast to the spectral transform strategy in [30], our method pays more attention to the difference between one pixel and all other pixels in the image, termed nonlocal gradients, to achieve NLTV spectral transform. With the nonlocal information, the edge details can be effectively preserved when segmenting the object in a variety of noises. In addition, our segmentation framework does not introduce the guided filter, which may bring the noise from the original image to the segmentation result. We perform the experiments on synthetic, natural, and medical cell images, which demonstrate that the proposed method can achieve competitive segmentation performance compared with the state-of-the-art methods.
Overall, the contributions of this work are twofold, which are as follows: (i) We illustrate the properties of NLTV spectral transform by theoretical proof and experiments. e analysis demonstrates that objects with varying size, contrast, and detailed structures can be distinguished in the NLTV spectral domain. Additionally, the transform is invariant to rotation and translation. ese properties indicate the feasibility of segmentation based on NLTV spectral transform. (ii) We propose an image segmentation framework using NLTV spectral transform, which fits a separation surface to filter sub-bands in the NLTV spectral domain, and it obtains segmentation results by means of postprocessing. Our method can achieve satisfying results for images with diverse noise or complex texture. e rest of the article is structured as follows: section 2 gives an overview of the NLTV spectral theory. Section 3 discusses the properties of NLTV transform and introduces our segmentation framework based on NLTV spectral transform. Section 4 illustrates the experimental results of the proposed method. At last, the paper is concluded in section 5.

Preliminaries
is section introduces the NLTV spectral transform framework [28]. e framework is made of several parts: nonlocal operators, NLTV flow, NLTV spectral transform, and spectral response.

Nonlocal Operators.
According to continuous definitions on the graphs of nonlocal gradient and divergence [31], three nonlocal operators, namely nonlocal derivatives, nonlocal gradients, and nonlocal divergences, are defined as follows: Let Ω ⊂ R 2 be a bounded domain and w(X, Y) ≥ 0 be non-negative weights between any two points, X, Y ∈ Ω. In the view of graphs, these weights correspond to a certain relationship between these points. For simplicity, we assume that these weights are symmetric, which means w(X, Y) � w(Y, X). en, Gilboa and Osher [28] extended the local derivative to a nonlocal version by the following definition: where u(X) is a real function, u: Ω ⟶ R, 0 < w(X, Y) < ∞, and z Y u(X) represents the partial derivatives of u(X) in the direction of point X and Y. Similar to local gradients derived from local partial derivatives, nonlocal gradient ∇ w u(X): Ω ⟶ Ω × Ω is defined as the vector composed of all partial derivatives.
Before introducing nonlocal divergence, the definition of inner product for vectors is shown as below. Denoting vectors as v 1 → product is defined as follows: en nonlocal divergence (div w v → )(X): Ω × Ω ⟶ Ω is defined as the adjoint of nonlocal gradient.

NLTV Flow.
e weight matrix W depends on the patch similarity. For fixed point X and arbitrary point Y in the image, W(X, Y) represents the weight between the points X and Y, which is defined as follows: where P(X) and P(Y) represent the patches centered at points X and Y in the image, respectively. σ is a parameter to control the decay of the exponential function. E(X, Y) describes the similarity between the points X and Y. NLTV is divided into two types, including isotropic NLTV and anisotropic NLTV. e former is defined as follows: e latter is defined as follows: In our work, the anisotropic nonlocal TV is applied to calculate NLTV flow.
2.3. NLTV Transform. e sine and cosine functions are the basic functions in Fourier transform. ese basic functions' amplitude forms impulses in the Fourier domain. e work [28] generalized this to NLTV domain. By examining the elementary structures disks for NLTV functional, the second derivative in the time of NLTV flow is considered the representation of the impulse of the elementary structure. Hence, the NLTV transform is defined by the following: where t ∈ (0, ∞) is a time parameter of the NLTV flow equation (7), and u tt is the second derivative in the time of the NLTV flow. For NLTV transform, the inverse transform reconstructs a signal or image from all ϕ(t) elements.

Wireless Communications and Mobile Computing
where u � (1/Ω) Ω u(X)dX is the residual part of NLTV transform, and it is also the mean value of the initial condition.

NLTV Spectral Response.
Corresponding to the amplitude of the response in Fourier domain, the NLTV spectral response is defined as follows: e NLTV spectral response can roughly measure the importance of image information at different time scales in the NLTV spectral domain [28]. e main features of the image emerge at the time scale corresponding to the high response. Otherwise, the NLTV spectral transform could be considered negligible.

Proposed Method
is section discusses the properties of the NLTV spectral transform and displays a segmentation method for images with noise using the NLTV spectral transform. Firstly, the seminal works [29,30], which demonstrate the properties of TV spectral transform in images with or without noise, are extended to the NLTV spectral transform in motivation. Secondly, a segmentation method using NLTV spectral transform for images with noise is introduced.

Motivation.
e section tries to research the properties of NLTV spectrum transform in images with or without noise. eories and experiments without noise are shown, firstly.
en, the properties are extended to the noise condition by experiments. As known to all, the typical noises in digital images are additive noise, multiplicative noise, and impulse noise. For this reason, we corrupt the images with Gaussian noise, Salt & Pepper noise, and Speckle noise.

Property 1: Sensitivity to Size.
A short proof about the property is provided. For the sake of simplicity, we consider scaling with a gray level image f(X), where X � (x, y) ∈ Ω.
en, the image after scaling can be denoted as f(aX). With the above notations, we explore why NLTV spectral transform values over the time scale of images before and after scaling satisfy the following relationship: where ϕ(t, X) and ϕ(t, X) are NLTV spectral transforms corresponding to images before and after scaling, respectively. Notice that for the original image f(X), the NLTV flow can be derived from the following partial differential equation: Inspired by the case of TV, we consider the elementary structures called nonlocal disks for the image f(X). A set A can be used as a nonlocal disk when two conditions are satisfied [28]: 1) A is a nonlocal calibrable set. 2) e curvature is constant on the internal boundary of the set A. e characteristic function of A is is expressed as follows: where λ A � (Per(A)/|A|) and Per(A) and |A| are, respectively, perimeter and normal of A. In the same way, the NLTV flow of nonlocal disk A ′ for the image f(aX) is as follows: e energy of points in the image f(X) and f(aX) decreases with the average speed of λ A and λ A′ , respectively. It is worth noting that λ A is equal to λ A′ because the object patterns before and after scaling are similar. Hence, we have Figure 1 is an example showing how the NLTV spectral transform separates different size objects. e multiscale NLTV spectral descriptions of the pixels are shown in Figure 1(b), which shows that there is a positive correlation between the size and the time to reach the max spectral response. In addition, we can find that the disappearance order of objects in Figure 1(c) is consistent with the order of reaching max spectral response time in Figure 1(b). Figure 1(d) shows the visualization of subbands in the NLTV spectral domain, and it is a more intuitive interpretation of figure 1(b). Moreover, Figure 2 shows the sensitivity of NLTV spectral transform to size and similar performance in different noises.

Property 2: Sensitivity to Local
Contrast. Combing the work [29], we attempt to provide a short proof. e image after gray-scale transformation by factor a is denoted as af(X).
en, we plan to prove that the NLTV spectral signatures of f(X) and af(X) satisfy the following relationship: It is noting that ϕ(t, X) is still related with characteristic function χ A (X) mentioned in property 1. Copying the analysis of property 1, the NLTV flows of f(X) and f(aX) are as follows: 4 Wireless Communications and Mobile Computing where t � at, A � A ′ , and X � X. u(t, X) � au(t, X) and An example is demonstrated on a synthetic image without noise, as shown in Figure 3. e image exhibited in figure 3(a) contains four different contrast squares with a black background. e NLTV spectral transform is calculated, and multiscale NLTV spectral descriptions of different pixels are shown in Figure 3(b). Figures 3(c) and 3(d) show more intuitive performance, which indicates that the low contrast squares disappear first. In addition, the NLTV spectral transform is implemented on different noises to verify its performance. As shown in Figure 4, except for small time scales, the NLTV spectral description has a similar performance, which demonstrates the sensitivity of the NLTV spectral transform to contrast images with noise. Figure 5 shows objects with diverse structures. Figure 5 an intuitive description. e center square with high contrast is decomposed, firstly. en, the square ring to which the blue point belongs starts to be decomposed. e black square ring is decomposed finally. e experiment indicates the sensitivity of the NLTV spectral transform to detailed structures. e phenomena are caused by the nonlinear property of the NLTV spectral transform. Assuming that images f and g make up the image h, the response of these images satisfies the following:

Property 3: Sensitivity to Detailed Structures.
To observe the decomposition process of NLTV spectral transform within noise, examples are carried out on different noises. Figure 6 shows the decomposition results of different pixels in diverse noises. It can be seen that, except for small time scales, the NLTV spectral description is similar to the case shown in figure 5(b). e experiments demonstrate that the NLTV spectral transform has a sensitivity to detailed structures.

Property 4: Invariance to Rotation and Translation.
Suppose the original image is denoted as f(X), X ∈ Ω. en, the image after rotation by angle θ about the origin is f(RX), where R is the rotation matrix.  Moreover, the image after translation by spatial shift on the original image is f(X − d) and f(X) � f(X − d). In essence, the rotation or translation of the image is equal to rotating or translating the coordinate system in the original image. On the other hand, the NLTV spectral transform is invariant to the coordinate system and sensitive to derivatives. erefore, the NLTV spectral transform is invariant to rotation and translation, i.e., ere are three groups of objects with different shapes in figure 7(a). e objects in the same group have the same shape and contrast. Different objects have been translated in different positions and rotated at different angles. As figure  7(b) shows, the objects in the same group have a similar NLTV spectral description. More intuitive illustrations are displayed in figures 7(c) and 7(d), which present that the objects within the same group disappear simultaneously. Figure 8 shows the NLTV spectral descriptions of different pixels corrupted with noises. e bottom row of Figure 8 shows that the objects with the same shape have similar descriptions in large time scales, even though they have distinct rotations and translations. Figure 9 shows the flowchart of the proposed method. e method starts with the decomposition of an original image in the NLTV spectral domain. en, the available information dimension of every pixel in the image increases from one to the number of time scales. To better get appropriate components, a soft threshold band-pass filter is selected to replace the traditional hard threshold band-pass filter. After obtaining the separation surface result, an inverse transform is used to get an abstract structure. e segmentation result is obtained with the help of the binary process and morphological operations.

NLTV Spectral Decomposition.
In the subsection, the process of image decomposition using the NLTV spectral transform is illustrated in detail. Assuming that the number of decomposition components is N, the NLTV flow xxx can be calculated with the help of formulae (6) and (7). According to the definition of the NLTV spectral transform described in formula (8), the second derivative of the element u(i) with respect to time scale needs to be computed. To speed up the calculation, the first and second derivatives are combined, expressed by formula (20).
where Δt is the time interval. NLTV transform is obtained based on u tt by equation (21).
e NLTV spectral response can also be calculated using equation (10). e residual can be computed by equation (9). If the forward time difference u t (i) � (u(i + 1) − u(i))/Δt is used to calculate the first derivatives, the residual part f can be transformed into formula (22).

Object and Background Separation.
After the decomposition of the original image in the NLTV spectral domain, the available information dimension of every pixel in the image increases from one to the number of time scales, i.e., the information used before decomposition is just pixel value. Inspired by the work [29], a separation surface is selected to effectively reduce the interference of noise on segmentation.
To better characterize the feature of objects in the image, time parameters t 1 and t 2 are chosen to construct a time range [t 1 , t 2 ]. By the above analysis of the four properties of NLTV spectral transform, the max response time is computed to describe the image. e max response time here is different from the spectral response of equation (10). As equation (10) shows, the spectral response calculates the element ϕ(t) of the image in the NLTV spectral domain and can reflect the significant part of the image. e NLTV element ϕ(t) on the time scale t corresponding to the low response contains unimportant features, which can be discarded. However, formula (10) demonstrates that it fails to reflect the spatial information of the objects. To better analyze the performance of pixels in the NLTV spectral domain, the max response time is calculated. Specifically, the NLTV spectral transform, firstly, decomposes the image into several spectral components on a time scale, as shown in Figure 9. en, every pixel in the image corresponds to a set of spectral responses. e time scale of the maximum spectral response is selected to indicate the performance of the local spatial information in the NLTV spectral domain. e maximum response time of pixels inside the same target tends to be close. erefore, different objects of the image can be extracted by analyzing the max response time corresponding to each pixel. In other words, a salient time map T(X) for each point X is calculated by equation (23).
To extract more meaningful information about the segmentation target, we fit a separation surface whose role is a band-pass filter to separate the target from undesired information. Firstly, the filtered max response map T filter (X) is obtained by performing the Gaussian filtering on T(X) to ensure the smoothness of separation surface. en, the time scale corresponding to the maximum spectral response is stored as scatters, on which the least square regression is performed to finish fitting the surface T sur (X). e fitted surface can be regarded as a soft threshold in the range of

Desired Objects Segmentation.
Image reconstruction, which is also called inverse transform, is implemented after surface fitting. e time scale band represents the integration times of each pixel for the object. e target in the original image is easily obtained by integrating over a specific time scale using reconstruction formula (25).   Binary processing is performed after inverse transform to obtain the segmentation mask. Finally, morphological operations are used to refine the final mask. By the above operations, the desired segmentation mask f output is obtained. To exhibit more details of the proposed method, Algorithm 1 shows the specific process of the NLTV spectral transform-based method for robust image segmentation.

Data and Settings.
To evaluate the performance of the proposed method, synthetic, natural, and medical images are used for experiments. 1) e first experiment contains 3 groups of synthetic images whose textures are taken from the Brodatz Textures dataset [32]. Speckle, Salt & Pepper, and Gaussian noises are added to each group of synthetic images separately. 2) e second experiment contains 3 groups of natural images taken from the MSRA-1000 dataset [33]. 3) e third experiment contains 1 group of cell images, which is taken from the Fluo-N2DH-SIM + dataset [34]. ree different types of noises are also added to natural and medical images.
We compare our segmentation method with four classical methods, i.e., the C-V model [13], FCM [14], FRFCM [19], and wavelet segmentation method (WSM) [27], which are used in the experiments. e experiments are implemented using the MATLAB R2020b platform and a PC with 16 GB RAM. e parameter settings for the proposed method are as follows: experiments show that when the image is transformed into the NLTV domain, detailed information is located in a low time scale. Large scale, which is close to T, contains less important information. Objects are mostly distributed in the middle scale. Hence, a middle-scale time range [t 1 , t 2 ] is selected. In the following experiments, t 1 is set to T/5 and t 2 is set to 3T/5. e parameters T and Δt are set to 9 and 0.03, respectively.
To measure the difference between segmentation results and ground truths, FPR and FNR are chosen in the subsequent experiments. e former calculates the number of background pixels classified as object pixels relative to the total background pixels. FNR measures the number of object pixels classified as background pixels relative to the total object pixels. FPR and FNR are defined as follows: where B R and B G represent the number of background pixels in the segmentation results and ground truths, respectively.
Additionally, O R and O G are the number of object pixels in the segmentation results and ground truths, respectively. DICE measures segmentation accuracy by calculating the degree of spatial overlap. Specifically, for the result region A and target region B, where ∩ means the intersection of two sets. e value range of DICE is [0, 1]. e higher DICE indicates that the segmentation result is more precise. DICE(A, B) � 1 demonstrates that the segmentation result is the most complete, while DICE(A, B) � 0 shows that the segmentation result is the worst. Another evaluation metric is SA, which can assess the number of well-classified pixels in the image. e definition of SA is given as follows: where f truth i means the correctly segmented pixel and N denotes the total number of pixels in an image.

Parameter Analysis.
is section analyzes the effects of Δt and T on the segmentation results of the proposed method through an experiment. e experiment was carried out on MSRA-1000, and the average SA was used as an indicator to show the influence of two parameters on the segmentation accuracy. e average SA was calculated by averaging the SA of all images on the dataset. e parameter Δt ranges from 0.01 to 0.1, and the step is 0.01. Additionally, the maximal time scale T ranges from 1 to 10, and the interval is 1. Figure 10 demonstrates the results for different Δt and T. e proposed method achieves the best performance when Δt � 0.03 and T � 9.

Synthetic Images.
e first experiment was implemented on three synthetic images, which are shown in Figure 11. e first row shows a synthetic image containing multiple repeating structures and a dark grid-like background. A simple synthetic image, which has an irregular object, is arranged in the middle row. e object in the bottom row is complex and has a texture with inhomogeneous contrast. Moreover, three images are separately contaminated with Speckle (10% variance), Salt & Pepper (10% density), and Gaussian (10% variance) noise. Table 1 lists the quantitative evaluations of different segmentation methods on various images. Combining with Figure 11 and Table 1, FCM got wrong segmentation results because of its sensitivity to noise. FRFCM achieved a good result on the first image and got a high DICE and SA value as shown in Table 1. However, it failed to distinguish the second and the third image because of the inhomogeneous contrast. WSM, which is based on spectral analysis, can remove the influence of noise. However, as Figure 11 shows, WSM oversmoothed the edge and damaged the edge details. Meanwhile, WSM was unable to segment objects accurately on the second and third images. e reason is that WSM is sensitive to inhomogeneous contrast.
e C-V model obtained the segmentation results of all images more correctly. One of the reasons was that the C-V model relies on an initial contour, which provides prior information about the approximate position of the object. Nevertheless, the C-V model was sensitive to noise. On the second and third images, the C-V model was unable to accurately segment the targets. e noises slowed down the convergence speed of the algorithm and made the method fall into the local minimum problem. However, the proposed method achieved the best results in all methods. e NLTV spectral transform-based method can segment the objects exactly and can reduce the influence of inhomogeneous contrast at the same time. e reason is that our method can segment objects, combining object size, contrast, and structures. As shown in Table 1, the proposed method got a high FNR on the second synthetic image, which intended an under-segmentation. e problem was caused by the morphological operators in the output of the proposed method, which may cause edge corrodes.

Nature Images.
To further discuss the proposed method's segmentation ability for images with various noises, the second experiment was performed on three natural images, which are shown in Figures 12, 13, and 14. e object that has a similar contrast to the surroundings is shown in Figure 12. Figure 13 displays a complex scene that has lots of tiny structures in the background. e object in Figure 14 is a piece of paper containing words, and the Input: gray image f. Output: segmentation mask f output .
(2) Calculate the number of decomposition components N � T/Δt.
i�0 using equations (6) and (7). (4) Calculate NLTV residual part f using equation (22). (5) for i � 1, 2, . . . , N do (6) Compute the second derivatives in time of flow for each pixel X by equation (20). (7) Achieve NLTV transform by equation (21). (8) Calculate NLTV spectral response using equation (10). (9) end for (10) Select time parameters t 1 and t 2 according to the NLTV spectral response. (11) Compute the salient time map T(X) by equation (23). (12) Obtain T filter (X) by performing Gaussian filtering on T(X). (13) Get the fitted surface T sur (X) by performing least square regression on T filter (X). (14) Reconstruct the result I(X) using equation (25). (15) Get the segmentation mask f bw (X) by thresholding segmentation on I(X). (16) Get the final mask f output (X) by performing morphological operations on f bw (X).    words will interfere with segmentation methods. Moreover, three images are separately contaminated with Speckle (10%, 20%, and 30% variance), Salt & Pepper (10%, 20%, and 30% density), and Gaussian (10%, 20%, and 30% variance) noise. As Figure 12 shows, the object has a similar contrast to the surrounding border. FCM separated the noise while segmenting the object because of its sensitivity to noise. FRFCM had better results than FCM, however, it still had wrong segmentation for noise. WSM can remove the influence of noise. However, WSM failed to remove the impact of inhomogeneous contrast.
e C-V model achieved accurate segmentation results because of its initial contour. From Table 2, it can be seen that the C-V model had similar DICE and SA values with the proposed method. However, the C-V model was difficult to segment corner structure because of noises. e proposed method can better preserve structural information while segmenting. Figure 13 shows the algorithms' performances on the natural images, which are corrupted by Salt & Pepper noise. Table 3 shows the corresponding quantitative metrics. In Figure 13, there are lots of small objects in the background, which have a similar contrast to the object. When these small targets are contaminated with Salt & Pepper noise, they cause serious interference with segmentation methods, which mainly rely on contrast. Figure 13 shows that WSM has a good result; however, it is unable to segment the areas surrounding the object correctly. Table 4 shows that the C-V model has better results than WSM; however, it still has an incorrect segmentation of the background. Because of the sensitivity of the NLTV spectral transform to contrast, size, and structures, the proposed method can still separate objects when the background has small size structures. Figure 14 shows the segmentation results on the natural image when it is corrupted with different levels of Gaussian noise. Table 4 shows the corresponding quantitative metrics of algorithms. e natural image is difficult for segmentation methods because it has complex texture like words inside, which will affect the integrity of the segmentation results. e C-V model was capable of dealing with the background, however, it was unable to handle the interference of the internal texture of the object. FRFCM and WSM dealt with the effect of noise and internal texture but failed to remove the interference caused by contrast. Moreover, WSM cannot obtain accurate edge information of targets. As shown in Figure 14, WSM expanded the object and the edge details disappeared. However, our method can deal with the interference made by noise. e NLTV spectral transform was sensitive to local contrast and size. Hence, it can separate the low-contrast words on the paper scrap. Because of the contrast and structure difference between the paper scrap and the background, the proposed method can separate the object from the background and extract the object's edge details correctly. Table 4 shows that the proposed method has high FNR values. From Figure 14, the bottom edge in the results of the proposed method is a little expanded, and the left edge is obviously corroded. e main reason is that the morphological operator makes the segmentation result corroded.

Medical Image.
e proposed method was evaluated on a medical image in this part. Because the medical image has a black background and the inference of speckle noise on the image is not obvious, the experiment was implemented on an image with Gaussian noise and Salt & Pepper noise. As Figure 15 shows, the top row is a cell image, which is contaminated with Gaussian noise, and the bottom row is the cell image corrupted with Salt & Pepper noise. On account of the noise, the initial contour of the C-V model generated a local minimum problem and was unable to be iteratively converged. As a result, the segmentation results of the C-V model can only be around the initial contour. FCM had wrong results because of its sensitivity to noise. FRFCM obtained the best result on the cell image corrupted with Salt & Pepper noise. However, Gaussian noise can cause FRFCM to generate an over-segmentation. WSM can  Table 5 shows that the proposed method achieves a high FNR value, which implies undersegmentation. As shown in the bottom row in Figure 15, the proposed method is difficult to segment the cells that have both small size and low contrast.

Conclusion
We have analyzed the properties of NLTV spectral transform with the help of theoretical proof and experiments. Our analyses demonstrate that the object in an image corrupted with various noises can be separated its size, contrast, and detailed structure. e analyses also illustrate that the objects with same structures have similar descriptions in the NLTV spectral domain. Furthermore, we have developed a novel transform-based method that segments images based on the NLTV spectral transform. e approach, firstly, decomposes an image into many sub-bands in the NLTV spectral domain and utilizes the max response time to represent the image features. en, to better divide the object and background, the sub-bands in the NLTV spectral domain are filtered by fitting the separation surface, which is calculated based on maximum response time. Next, the filtered image is reconstructed by an inverse transform to obtain the rough segmentation result. Finally, the segmentation mask is calculated using postprocess methods. Subjective and objective evaluations show that the proposed method effectively protects the edge details while segmenting the object in a variety of noises.
However, one limitation of the proposed method is the high computational cost since the computation of nonlocal operators needs a long time and large memory storage. e other limitation of the method is the difficulty in fitting multiple separation surfaces accurately. We attempt to solve the aforementioned problems and develop a fast multiobject segmentation method in future work.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.