Optimized Parallelization for Nonlocal Means Based Low Dose CT Image Processing

Low dose CT (LDCT) images are often significantly degraded by severely increased mottled noise/artifacts, which can lead to lowered diagnostic accuracy in clinic. The nonlocal means (NLM) filtering can effectively remove mottled noise/artifacts by utilizing large-scale patch similarity information in LDCT images. But the NLM filtering application in LDCT imaging also requires high computation cost because intensive patch similarity calculation within a large searching window is often required to be used to include enough structure-similarity information for noise/artifact suppression. To improve its clinical feasibility, in this study we further optimize the parallelization of NLM filtering by avoiding the repeated computation with the row-wise intensity calculation and the symmetry weight calculation. The shared memory with fast I/O speed is also used in row-wise intensity calculation for the proposed method. Quantitative experiment demonstrates that significant acceleration can be achieved with respect to the traditional straight pixel-wise parallelization.


Introduction
X-ray Computed Tomography (CT) can reflect human attenuation map in millimeter level, in which rich 3D information of tissues, organs or lesions can be provided for clinical diagnosis. Though its wide application in clinics, the radiation delivered to patients during CT examinations is always a wide-spread concern. It was reported in [1] that CT radiation may increase the risk of developing metabolic abnormalities even cancer. The most practical means to lower radiation dose is to decrease tube current (milliampere (mA)) or tube current time products (milliampere second (mAs)). However, lowering mA/mAs settings often leads to degraded CT images with increased mottled noise and streak artifacts [2,3], which will influence the diagnosis accuracy [4][5][6][7].
Researchers can suppress noise and artifacts in low dose CT (LDCT) images by developing new reconstruction or postprocessing algorithms. Current solutions to improve the quality of LDCT images can be roughly divided into three categories: preprocessing approaches, iterative reconstruction approaches, and postprocessing approaches. The first one refers to those techniques that improve CT imaging by suppressing the noise in projected raw data before the routine FBP reconstructions. The key of these techniques is to find the accurate statistical distribution of projected data and design effective restoration algorithms [5,6]. The second one refers to iterative reconstruction approaches which treat the LDCT imaging as an ill-posed inverse problem and solve the problem as a prior-regularized cost function via some iterative optimization solutions [7,8]. Though effective in obtaining favorable reconstructed image quality, the most well-known limit for iterative reconstructions is the required intensive computation in iterative optimization. Additionally, for patent protection consideration, current mainstream CT device suppliers normally do not provide well-formatted projected data, which severely restricts the research and the possible clinical application of these two study directions.
The third one refers to postprocessing methods, which can be directly applied to improve LDCT images. Distribution and scale features of noise, artifacts, and normal tissues in CT images need to be jointly considered in designing effective postprocessing algorithms [9,10]. It was pointed in [9][10][11][12][13][14] that the nonlocal means (NLM) filtering, which utilizes the information redundancy property, can effectively suppress noise and artifacts without obviously blurring image details. We would also note that the patch similarity metric in NLM has also been used to build regularization term for tomographic reconstruction [15,16].
However, since noise and artifacts often distribute with prominent amplitudes in LDCT images, a large searching window is practically required to include more structure information in noise/artifact suppression, which implies a large computation cost. This will strongly limit its clinical application considering the large workload in current radiology departments. To overcome this, this paper presents an improved GPU-based parallelization approach to accelerate the NLM filtering. The proposed approach optimizes the computation in NLM filtering by avoiding repeated computation with row-wise intensity calculation and weight calculation. The fast / data access speed for shared memory in GPU is also well exploited to reduce data / operation cost. Experiment results on 2D LDCT images demonstrate that the improved parallelization can significantly shorten computation time and, thus, making itself a potentially applicable processing procedure in LDCT imaging.

Nonlocal Means Based Low Dose CT Image Processing
Compared to restoration algorithms based on intensity gradient information, NLM filtering can provide edge-preserving noise/artifact suppression without blurring image structures.
In NLM filtering, one image patch is matched with a group of similar patches in a large neighboring area, and in this way more structure similarity information in large neighboring scale can be used to suppress noise and artifacts in LDCT images. The NLM algorithm replaces pixel intensities by the weighted average of intensities within a searching window . Each weight expresses the similarity between the central pixel and the neighboring pixels in the searching window and is calculated by the Euclidian distance between patches surrounding these two pixels. Let ( = ( , )) denote the pixel to be processed, let denote a pixel in the search neighborhood window, let denote the processed image, and let denote the image to be processed; the 2D NLM filtering algorithm can be formulated as follows [17]: (Δ ,Δ ) , where denotes the searching window centered at ; ( , ) denotes the similarity of the two patches centered at and , respectively, with the radius ; (Δ , Δ ) is a distance dependent Gaussian kernel function; the number of pixels in a patch is (2 + 1)(2 + 1). We routinely use the parameter ℎ in (2) to control the smoothing effect.
In Figures 1-4, we give the NLM filtering processed results of four 2D LDCT images in Figures 1(a), 2(a), 3(a), and 4(a) and the two corresponding standard dose CT (SDCT) images are given in Figures 1(b), 2(b), 3(b), and 4(b) as references. The LDCT and SDCT images were collected using the reduced tube current 80 mA and the routine tube current 240 mA, respectively. We can see that CT images are mainly composed of pixels with limited intensity range, and the intensities representing different tissues spread over the whole image domain. Other scanning parameters were set by default. Compared to the reference SDCT images, we can see that tube current reduction will lead to severely increased noise and artifacts in LDCT images. Figures 1(c), 2(c), 3(c), and 4(c) illustrate the processing results of NLM filtering, in which the size of searching window is 81 × 81, patch size is 9 × 9 ( = 4), and parameter ℎ is set to 10. Figure 5 shows the results of 3D volumes by processing a set of 2D thoracic LDCT images. The illustrations in Figures 1-4 are presented in suitable windows. All the parameters were set under the guide of an experienced doctor in radiology department. We can see that the NLM filtering can effectively suppress both mottled noise and artifacts in LDCT images without leading to significantly blurred structures.
To highlight the importance of a large searching window, we also list in Figure 1(d) the NLM processed result with 21×21 searching window (other parameters are set to be same as the result in Figure 1(c)). We can see that processing with this smaller 21 × 21 searching window fails to give satisfying artifact suppression (see the arrows). Therefore, a large searching window (up to 81 × 81) needs to be used to in NLM filtering to include enough large-scale similarity information to suppress noise and artifacts in LDCT images, as can be seen in Figure 1(c) [10]. However, the patch similarity calculation within a large searching window is always accompanied with large computation load. For a × sized image, with the pixel number of searching window being | | and patch radius being , we get the computational complexity ( | |(2 + 1)(2 + 1)) for the original CPU based serial processing, and the total computational complexity amounts to (512×512× 81 × 81 × 9 × 9) = (1.3931 × 10 11 ) for 512 × 512 sized images. This computation cost is too high to provide real-time CT imaging for radiology department routine; so we need to accelerate the NLM filtering in order to give fast clinical application.

Introduction to CUDA-Based GPU Acceleration.
Utilizing GPU based techniques to parallelize algorithm has already become a notable trend in the field of parallel computing. The GPU based parallelization is achieved by jointly parallelizing coarse-scale patches and fine-scale threads in the original grid computation task which is parallelizable [18][19][20]. The CUDA (Compute Unified Device Architecture) technology provides a software platform for developers to design parallelized tasks with C-style code, with direct access to the virtual instruction set and GPU memories. Each parallelization function running on CUDA is called a kernel, and we use ( ) ∘ ( −1) ∘ ⋅ ⋅ ⋅ ∘ (1) to represent the connected parallelized cascade functions based on [20]. The output of kernel function ( ) is the input of ( +1) . We use ( ( ) 1 , . . . , ( ) ) to represent the input data of kernel function ( ) in processing and ( ( +1) 1 , . . . , ( +1) ) to represent the output data of function ( ) . We can use (5) to represent the kernel function as follows: where represents the pixel position in image. It should be noted that in some cases not all the input data need to be updated (e.g., ( +1) = ( ) ).

Conventional GPU Based Parallelization for NLM Filtering Algorithm.
The conventional GPU based parallelization accelerates the NLM filtering algorithm by direct pixel-wise parallelization. Based on above (1)-(4), we routinely break the algorithm into four parts specified by the following (6)- (9), which are computed in loops. The number of loops is set as the searching window size | | to traverse all the neighboring points in window . The first kernel function (6) computes intensity differences in parallel via GPU and has computational complexity (1). Here we quantify the computational complexity using the operation times in parallel. In (6), ( + , + ) denotes the spatial position of the neighboring pixel in the searching window centered at , which can also be represented by spatial position ( , ). We set (1) For data 2 , the second kernel function computes the patch similarity using (7) based on the patch difference computed by (6)  complexity of the second kernel function is around ((2 + 1)(2 + 1)) because there are (2 + 1)(2 + 1) weighted summation operations for each pixel pair in the two comparing patches. Consider ℎ (2 + 1) (2 + 1) ) ) .

Improved GPU Acceleration for NLM Filtering Algorithm.
In the above conventional parallelization approach, the second kernel function in (7) is serially applied to compute the patch similarity, which leads to large computation cost when large searching window is used. Our first improvement is, thus, devoted to reduce the computational complexity in this part. Figure 6 illustrates that a patch is of size 5 × 5 ( = 2) with the red point indicating the center point. Equations (1)- (4) show that the patch similarity in NLM filtering can be quantified by the weighted sum of intensity differences of the corresponding pixels in the two patches. In Figure 6, we can see that, for the center points located at the green points in the two patches, the summed intensity difference of the blue points in the same rows is in fact the same value as the case when the center points moves down to the red points. This implies that the intensity differences of rows are repeatedly computed when the center points move within ( + 1) pixel distances. Therefore, we can efficiently calculate patch differences via the following row-wise calculation: where the intensity difference between two individual pixels is | ( +Δ , )− ( +Δ , )| and = ( , ) represents the neighboring point positions in the searching window. ⋃ Δ ∈[0,..., ] denotes the dataset that includes the ( + 1) different points in the vertical direction. Thus, with patches of size (2 + 1) × (2 + 1), we know from (10) that ( + 1) values can be obtained for ( + 1) different row pairs. The row difference needs to be calculated only once before being stored in the shared memory, and the other operations in (10)  cost of (10) lies in the data accessing operation of the global memory because the time cost in shared memory accessing is trivial when compared to global memory accessing. The computational complexity of (11) can be roughly estimated to be (2 + 1) [21]. Similar to the conventional GPU parallelization, we also divide the algorithm into the following four parts (11)- (14) and compute in loops. Suppose that the input image is of size × , we set the size of ( ) 1 to be × × ( + 1). The data ( ) 2 ,    The initialization is also set with data (1) 3 = (1) 4 = 0, ( ) 5 = . 8

Computational and Mathematical Methods in Medicine
The first kernel function computes the sum of the intensity differences for each row pair, which is multiplied by the Gaussian weight calculated based on the perpendicular distance from the row to the center point. The computational complexity of this kernel function is (2 + 1). Consider ( + Δ , ) − ( + + Δ , + ) (Δ , Δ ) ) .

(11)
The second kernel function calculates the similarity of patches based on (12). In (12), we compute the patch similarity by accumulating the absolute values of the weighted sum of the intensity differences calculated via the first kernel function (11). The computational complexity of the kernel function (12) is (2 + 1). Consider ℎ (2 + 1) (2 + 1) ) ) .
Then, a final kernel function (14) can be applied to obtain the finally processed image: .
The final loop number with respect to the required operation number as to the searching window now becomes (2 + 1) × ( + 1) + 1 ( denotes the radius of the searching window), which is approximately 0.5| |. The final output image iŝ( ) = ( ) 4 ( )/ ( ) 3 ( ). To conclude, the total computational complexity of the improved algorithm is around (0.5| |(2(2 + 1) + 1) + 1), which approximately equals to (| |(2 +1)+1). We can see that the computational complexity has been reduced to 1/(2 + 1) with respect to the conventional parallelization.  Figure 8: Comparison of the computation time with respect to searching window size for the conventional parallelization algorithm and the improved parallelization algorithm.

Experimental Results and Analyses
In this section we compare the computation cost of different methods. To verify the improvement brought by the proposed acceleration method for the NLM filtering, we process the same 512 × 512 LDCT image in Figure 1(a) using the serial algorithm (CPU based), the conventional parallelization algorithm (GPU based), and the improved parallelization algorithm (GPU based). In this section, we do not illustrate the processed images because the same images as Figure 1(a) were obtained. We set the patch size to be 9 × 9 and record the computation time with respect to the size of searching window. Figure 7 illustrates the computation time of the serial algorithm and the conventional parallelization algorithm. We can observe that the conventional parallelization significantly reduces the computation cost through straight pixelwise parallelization and achieves an acceleration ratio of more than 100 times of the original serial algorithm. The system configuration for our experiments is given as follows. Then, we compare the computation time of the conventional parallelization algorithm and the improved parallelization algorithm with respect to the size of searching window. The patch size is fixed to 9 × 9. As can be seen in Figure 8, when the searching window size becomes larger than 41 × 41, the acceleration ratio approximately equals to 2 + 1 = 2 ×   4 + 1 = 9, and this observation is consistent with the above deduced acceleration ratio 2 + 1. In addition, we compare the computation time of the conventional parallelization algorithm and the improved parallelization algorithm with respect to patch size. The searching window size is fixed to 81 × 81. Since a large patch size often leads to blurred images, we hereby set the maximal patch size to be 15 × 15. We can observe in Figure 9 an obvious increment of acceleration ratio when the patch size increases, and this again verifies the above deduced acceleration ratio 2 + 1.

Discussion and Conclusion
In this paper we further optimize the parallelization for NLM filtering in CT image processing. The proposed approach optimizes the parallelized computation in NLM filtering by avoiding repeated computation with row-wise intensity calculation and weight calculation. The fast / data access speed for shared memory in GPU is also well exploited. We applied our improved algorithm to LDCT image processing and find that the improved algorithm can achieve a significant acceleration ratio with respect to the conventional parallelization algorithm. For now, it takes about 0.8 second to process one 512×512 CT image with 81×81 searching window and 9× 9 patch, and this parameter setting in NLM filtering is found to be able to provide effective processing. This paper only provides the results on 2D NLM filtering, and we would stress that the same parallelization strategy can be easily extended to accelerate the more computationally intensive 3D NLM filtering, and the same acceleration ratio as 2D case can be expected because they have the same calculation structures.
To be specific, this extension can be realized by replacing the row-wise optimization in (11) by a plane-wise optimization. Nevertheless, we would also point it out that 3D NLM filtering should not be suggested for the processing of CT slices with large slice thickness (>2 mm) because of the poor interslice continuity in this case. Currently, the structure similarity idea in NLM has got wide applications in the other field of image processing (e.g., image segmentation and image reconstruction) [15,16,22,23]. The proposed parallelization optimization can be directly applied to accelerate the patch similarity calculation in these applications. In current parallelization approach, the weights reflecting patch similarity are calculated via a serial loop in (7), which can be further parallelized via interkernel operations to realize a further acceleration. Accelerating the computation speed by combining multiple-core CPU strategy with GPU parallelization technique will also be explored. This optimization strategy can be easily used to accelerate other reconstruction or restoration tasks using the patch similarity type metrics [24][25][26]. We can also consider improving the performance of the NLM filtering by incorporating the fractional metric into the calculation of patch similarity [27]. Evaluation of the potential accuracy enhancement in segmentation/registration (related with CT images) that can be brought by the proposed processing also needs to be performed [28][29][30]. All these issues will be addressed in the future work.