A Fast CT Reconstruction Scheme for a General Multi-Core PC

Expensive computational cost is a severe limitation in CT reconstruction for clinical applications that need real-time feedback. A primary example is bolus-chasing computed tomography (CT) angiography (BCA) that we have been developing for the past several years. To accelerate the reconstruction process using the filtered backprojection (FBP) method, specialized hardware or graphics cards can be used. However, specialized hardware is expensive and not flexible. The graphics processing unit (GPU) in a current graphic card can only reconstruct images in a reduced precision and is not easy to program. In this paper, an acceleration scheme is proposed based on a multi-core PC. In the proposed scheme, several techniques are integrated, including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multithreaded computation, and an Intel C++ compilier. Our scheme maintains the original precision and involves no data exchange between the GPU and CPU. The merits of our scheme are demonstrated in numerical experiments against the traditional implementation. Our scheme achieves a speedup of about 40, which can be further improved by several folds using the latest quad-core processors.


INTRODUCTION
CT imaging has seen tremendous development over the past decades. Now, it is widely used in the medical imaging field. However, due to the high computational cost required for reconstruction, its real-time imaging applications [1] remain challenging. Bolus-chasing computed tomography (CT) angiography is a primary example which demands real-time CT feedback.
To address this problem, various techniques are used for fast image reconstruction. A number of commercial hardware-based solutions are available. For example, XTrillion (by TeraRecon, Inc.) uses an application-specific PCI card, while Mercury Computer Systems relies on bladebased Linux clusters. However, the specialized hardware is expensive and unsuitable for general purpose applications. Alternatively, efforts are made using graphic cards [2,3], since the main operation for commercial CT reconstruction is backprojection, similar to texture mapping in computer graphics [4]. Although graphics cards are highly optimized, they do not support floating-point calculations. Hence, they are not ideal for medical imaging applications. Despite that the latest graphics cards can implement vir-tual floating-point calculations [3,5], they do not support full 32 bits floating calculations. Another bottle-neck is that the graphic cards require data exchange between CPU and GPU.
In this paper, a multi-core PC-based acceleration scheme is proposed for filtered-backprojection-(FBP-) based image reconstruction. This scheme reduces computational cost and maintains image quality. Our scheme integrates the following techniques for fast image reconstruction. First, geometric symmetry is taken into account to eliminate redundant operations. That is, only one computation is performed for multiple symmetric positions. Second, efficient data structures are used to minimize the data access time. Third, the single-instruction multiple-data (SIMD) technique is employed for data-level parallel processing. Fourth, the multithreading programing is done to take advantage of multicore processors, realizing the true parallel computation capability. Finally, an Intel C++ complier is used to optimize the code for Intel processors.
This paper is organized as follows. In Section 2, the CT reconstruction algorithm is overviewed, and then each of our acceleration techniques is described. In Section 3, numerical experiments on various datasets and different PCs  are presented to evaluate the speedups with our scheme and the conventional implementation. In Section 4, relevant issues and research directions are discussed.

CT reconstruction algorithm
The most popular multislice CT reconstruction methods remain data rebinning-based fan-beam reconstruction filtered backprojection (FBP) algorithms. Therefore, our work is focused on the typical fan-beam FBP algorithm. Note that the application of our scheme is not limited to the fan-beam case, because it can also be applied to accelerate the latest approximate cone-beam algorithms [6][7][8], which can be treated as generalized fan-beam reconstruction algorithms.
In a typical CT setting, the data acquisition system (an X-ray source and a detector assembly) is rotated rapidly in the gantry while the patient on a table is translated into the gantry opening. This process is illustrated in Figure 1. Because the multi-row detector arrays span a very small cone angle, acquired helical scan data are usually rebinned into a series of virtual circular scan data for reconstruction of a stack of images [9]. Here we assume a method from [10], in which the virtual fan-beam projection data are calculated according to the following formula:  where β = β + k × 2π, k ∈ N, so that z a ≤ z 0 ≤ z b , p c denotes virtual circular scan data, p h denotes acquired helical projection, z a and z b are the distances from projections a and b to the virtual circular plane, respectively, in Figure 2, β and γ are the projection angles shown in Figure 3. After transforming helical projection data p h to circular fan-beam projection data p c , the conventional fan-beam reconstruction algorithm [11] can be used. As the rebinning cost is insignificant, our optimization targets the reconstruction process: where f is an objectfunction to be reconstructed, p c, f are the filtered projection data, D is the distance from the source to the center of rotation, and h(γ) is the ramp filter [11]. While the inner convolution is the filtration process, the outer integration is the most time-consuming backprojection process, as shown in Figure 4(a).

Acceleration techniques
Since the backprojection is the bottleneck, let us analyze the backprojection process as shown in Algorithm 1. Clearly, a large part of the computational cost is due to the inner loop that calculates γ 0 , 1/L 2 , interpolation coefficients, and accumulates the incremental contributions to the final reconstruction. In the following, we show how the backprojection can be speeded up using various techniques.

Utilization of geometric symmetry
For our circular fan-beam reconstruction, two types of symmetries are available, which are referred to as the right-angle symmetry and complement symmetry. The right-angle symmetry, or 90-degree symmetry, is shown in Figure 5. That is, a new pair of source and pixel positions is obtained by applying a 90-degree rotation to a current pair of source and pixel positions. The resultant 4 pairs of source and pixel positions share the same 1/L 2 and γ 0 , which can be calculated for loop of every x projection angel β ∈ [0, 2π) (increasing step is Δβ) for loop of every y coordinate for loop of every x coordinate Calculate L of pixel (x, y); Calculate γ 0 of pixel (x, y); Calculate projection data p c, f (γ 0 , β) at channel γ 0 by liner interpolation Weight interpolated projection data by Δβ/L 2 , that is, p c, f ,w = p c Δβ/L 2 . Accumulate weighted and interpolated projection data p c, f ,w to pixel (x, y) end end end Algorithm 1: Pseudocode for the backprojection.
from (3) and (4), respectively. As the interpolation coefficients required by the backprojection are determined by γ 0 , they are the same as well. Therefore, for the four sets under consideration, the calculations of these parameters need to be done only once: 4 International Journal of Biomedical Imaging Right-angle symmetry where (x p , y p ) denotes the pixel in the first quadrant. The following two requirements, which are usually satisfied in practice, are necessary to use the right-angle symmetry. The first requirement is that the projection data must be available at the involved four angles. Namely, the number of projections in a full scan must be divisible by 4, which is reasonable for current medical CT scanners. For instance, a SOMATOM system generates 1160 projections per turn, while a Lightspeed scanner produces 984 projections per turn. The other requirement is that the reconstruction region must be symmetric about the xand y-axes, such as a square or a circle in the clinical imaging situation.
The second type of symmetry is the complement symmetry, as shown in Figure 6. Here, a pair of source and pixel positions complements the other pair of source and pixel positions if they are symmetric with respect to a diagonal line (e.g., y = x). For these 2 pairs of source and pixel positions, L s are the same, while γ 0 for one has the opposite sign of that for the other, as shown by (5) and (6), respectively,  Therefore, such a symmetry can also be used to reduce the computational cost. The requirements for use of the complement symmetry are the same as those for the right-angle symmetry.
Using these two types of symmetries, the backprojection can be significantly speeded up, since only one set of parameters needs to be calculated for the eight sets. The implementation of the backprojection is accordingly modified, as shown in Algorithm 2. Note that after the calculation of γ 0 and L once for 8 pairs of source and detector positions, 8 filtered projection values are put to 8-pixel positions together in the inner-loop.

Optimization of data structures
To evaluate the computational complexity, the time for CPU to access data must be considered, especially for the CT reconstruction process because the backprojection requires frequent visits to a great amount of filtered projection and image data. The CPU data access mechanism with multilevel caches is illustrated in Figure 7. Specifically, a cache can be used to reduce the average time to access data in the main memory (RAM). The cache is a smaller, faster memory chip which stores copies of data from the most frequently used main memory locations. As long as a majority of memory accesses are to the cached memory locations, the average latency of memory accesses will be reduced to the cache latency, instead of the main memory latency. The L1 cache is the fastest and usually about 16 ∼ 32 KB. The L2 cache is faster than RAM and about 1 ∼ 2 MB. The slowest RAM is 1 ∼ 4 GB.
When the processor needs to read from or write to the main memory, it first checks if the data is in the cache. If it is in the cache, we say that a cache hit has occurred; otherwise Kai Zeng et al.

for loop of every projection angel β ∈ [0, π/4) (increasing step is Δβ)
for loop of every y coordinate for loop of every x coordinate Calculate L of pixel (x, y); Calculate γ 0 of pixel(x, y); For Sets 1, 2, 3, 4 Calculate projection data p c, f (γ 0 , β) at channel γ 0 by liner interpolation Weight interpolated projection data by Δβ/L 2 , that is, p c, f ,w = p c Δβ/L 2 Accumulate weighted and interpolated projection data p c, f ,w to pixel (x, y) γ 0 = −γ 0 For complementary Sets 1c, 2c, 3c, 4c Calculate projection data p c, f (γ 0 , β) at channel γ 0 by liner interpolation Weight interpolated projection data by Δβ/L 2 , that is, p c, f ,w = p c Δβ/L 2 Accumulate weighted and interpolated projection data p c, f ,w to pixel (x, y) end end end Algorithm 2: Pseudocode for the backprojection with the right-angle and complement symmetries.  a cache miss is counted. In the case of a cache hit, the processor immediately reads or writes the data. However, in the case of a cache miss, it takes much longer time to access the data. Due to the limited cache capacity, one way to execute the code efficiently is to increase the hit rate by optimizing the data structures. Usually, projection data are sequentially stored in the order of β, while reconstructed images are stored rowwisely. Thus, for implementation of the right-angle and symmetry, the access to 8 pairs of projection and image data will very likely result in cache misses due to the address gaps, as shown in Figure 8. Such misses within the inner loop will cause a significant latency. To address this problem, in our optimized data structures all the data are arranged into blocks indexed to reflect symmetric relationships. Therefore, the cache miss rate can be greatly reduced in the inner loop.

SIMD technique
The SIMD technique enables the data-level parallelism like in a vector processor, as shown in Figure 9. With an SIMD processor, one instruction can process a block of data at a time instead of just one datum, which is much more efficient than the conventional single instruction single-data (SISD) technique. Small-scale (64 or 128 bits) SIMD operations are now popular supported by general PC CPUs, such as those from Intel and AMD [12,13]. We use the Intel SSE (streaming SIMD extensions) instruction set to implement the SIMD technique in our backprojection process. Within the inner loop, we backproject 8 projection data onto 8 pixels, according to the same instructions such as interpolation, weighting, and accumulation. Therefore, we have a perfect situation to employ the SIMD technique. As the SSE only supports simultaneous processing of 4 floating data at a time (128-bits register), 8 data are processed in two groups.

Multithreaded programing
In recent years, the great increment of the clock speed of PC processors seems difficult. Intel is bounded by 4 GHz, while AMD stays under 3 GHz. Their efforts have now shifted from improving the clock speed to increasing the number of cores within a processor. Dual-core quad-core processors become commercially available for a PC. However, a processor with more than one core cannot achieve a better performance unless parallel computation schemes are applied. Therefore, to take advantage of multi-core processors, multi-threaded programing must be done.
From the flowchart of our algorithm, the computation within the inner loop is independent. Thus, the backprojection can be implemented in parallel by assigning different 6 International Journal of Biomedical Imaging Conventional projection data storage β β + π/2 β + π β+ 3π/2

Address gaps
Conventional image data storage Address gaps (a) Optimized projection data storage (no gap) β β+ π/2 β + π β+ 3π/2 β comp β comp + π 2 β comp + π β comp + 3π 2 Optimized image data storage (no gap)  loop ranges to various cores of the processor. After all the threads are finished, the final result can be assembled from the results of each thread. In our implementation (Figure 4(b) and Algorithm 2), we divide the loop of y, instead of the loop of β. Usually, the number of cores on a PC is 2, 4 or 8, it is not common for N β /4 to be an integer, but it is always the case for N y to be 256, 512, or 1024. Our parallel implementation on a multi-core PC is more efficient than that on a PC cluster in terms of time required for data exchange between threads. In our case, the data exchange is via on board RAM bus, while the PC cluster's data exchange via local network is significantly slower.

Intel C++ compiling
The Intel C++ Compiler creates applications that can run at the fastest speeds on the Intel processors. It can take the full advantage of the Intel processors when compiling codes and generating object files. The Intel C++ compiler can be coupled with the Visual Studio. This provides an integrated development environment. In our implementation, we use it to optimize the code for Pentium D and Xeon processors.

NUMERICAL EXPERIMENTS
To test the gain of our scheme, we ran our accelerated code on Pentium D and Xeon PCs. Their configurations are listed   in Table 1. Besides, different sizes of projection datasets and reconstructed images were tested to evaluate the efficiency under various conditions. Here to test effeteness of each technique, the reconstruction times and speedups are tested by applying them gradually. The reconstruction experiments are done based on our HP6200 workstation and reconstructing a 512 × 512 image from a projection dataset (1160 × 672). The reconstruction results are shown by applying techniques step by step ( Table 2). The overall speedup and individual speedups for each technique are also calculated to show the efficiency of them.
The speedup results for different projection datasets and image matrix sizes are shown in Tables 3, 4, and Figure 10. The results on different computers are consistent. Significant speedups were achieved using our scheme. In the case of 1160 views and 512 × 512 image, the reconstruction time was de-creased from 52 seconds to 1.35 seconds. For a one-core computer, the speedup was more than 20 times. For a two-core computer, the speedup was almost 40 times when 2 threads were used.
The Shepp-Logan head phantom was used in our numerical experiments. The images reconstructed using our scheme and the conventional method are shown in Figure 11. All the images were reconstructed using 32-bit floating-point data and were displayed in the same window [0.97, 1.05]. The images reconstructed using our accelerated and conventional schemes are essentially the same.

DISCUSSION AND CONCLUSIONS
As the CT reconstruction algorithm is highly parallelizable, the speedup can be improved with more cores almost linearly. For example, with two quad-core processors,  the speedup that could be achieved is more than 100. As compared to other acceleration techniques, such as those based on specialized hardware and graphics cards, our general purpose PC-based scheme is much cheaper without compromising image quality. For example, a general purpose HP or Dell workstation with a top-line two quad-core processor and 8 GB RAM is less than $7000. All calculations are based on 32-bit floating point data, providing sufficient accuracy for medical imaging applications. In terms of the absolute reconstruction time for a 512 × 512 image from 1160 projection views, it has been decreased from 52 to about 1.25 seconds. If the latest multi-core processor is used, the total time can be easily decreased by several folds. As the computers we have still use the previous generation processor, the potential improvement is at least 5 times if we are equipped with the latest quad-core processors [14], that is, the reconstruction time may be reduced to 0.3 second. Hence, it is quite promising for real-time CT applications, such as project on Bolus-chasing CT angiography.
In conclusion, our acceleration scheme has integrated several techniques including utilization of geometric symmetry, optimization of data structures, single-instruction multiple-data (SIMD) processing, multi-threaded computation, and an Intel C++ complier. As a result, it has speeded up the reconstruction process by 40 times, as compared to the conventional implementation on a general purpose PC with 2 cores. Further work is in progress to improve our results using the latest PCs and extend our scheme for cone-beam reconstruction.