^{1}

^{2}

^{1}

^{1}

^{2}

Currently, 3D cone-beam CT image reconstruction speed is still a severe limitation for clinical application. The computational power of modern graphics processing units (GPUs) has been harnessed to provide impressive acceleration of 3D volume image reconstruction. For extra large data volume exceeding the physical graphic memory of GPU, a straightforward compromise is to divide data volume into blocks. Different from the conventional Octree partition method, a new partition scheme is proposed in this paper. This method divides both projection data and reconstructed image volume into subsets according to geometric symmetries in circular cone-beam projection layout, and a fast reconstruction for large data volume can be implemented by packing the subsets of projection data into the RGBA channels of GPU, performing the reconstruction chunk by chunk and combining the individual results in the end. The method is evaluated by reconstructing 3D images from computer-simulation data and real micro-CT data. Our results indicate that the GPU implementation can maintain original precision and speed up the reconstruction process by 110–120 times for circular cone-beam scan, as compared to traditional CPU implementation.

Computed Tomography (CT) has become one of the most popular diagnostic modalities since its invention thirty years ago. Compared with 2D parallel-beam and fan-beam CT, 3D cone-beam CT system is able to achieve higher special resolution and better utilization of photons [^{3} image array and is 4 GB for a 1024^{3} image array. The gigabyte data size is huge even for a graphic workstation. Currently, image reconstruction speed is still a bottleneck for the development of 3D cone-beam CT. The study of fast and efficient reconstruction algorithms for large volume image and their implementation on hardware or software will have important significance both theoretically and practically [

The Graphics Processing Unit (GPU) can process volume data in parallel when working in single instruction multiple data (SIMD) mode [

Back to 1990s, only high-end workstations, such as the SGI Octane or Onyx, had the level of graphics hardware necessary for CT image reconstruction. Cabral et al. were the first to employ this hardware for the acceleration of CT reconstruction [^{3}. Muller and Xu have studied the CT image reconstruction for large volume data [

This paper is organized as follows. In Section

The filtered back-projection algorithm proposed by Feldkamp, Davis and Kress (FDK) for 3D volume reconstruction from circular cone-beam projections still remains one of the most widely used approach [

Current GPUs can be used either as a graphical pipeline or as a multiprocessor chip thanks to the CUDA interface from Nvidia. For both options, the acceleration factor of GPU is high. Xu and Mueller have observed that an implementation of the cone beam back-projection using the graphics pipeline is 3 times faster than the one made with CUDA interface [

Axis-aligned stack of 2D-textured slices for representing reconstructed volume. (a) along _{v}

Figure

Geometry of back projection. The slice under reconstruction has each filtered x-ray image projected onto it by projective texture mapping.

Here is our detailed algorithm for backward-projecting a projection image to a volume slice based on GPU. As shown in Figure

In formula (

By implementing the calculation of formula (_{p}

The above procedure is executed repeatedly until every volume slice is processed from every projection view, thus the entire reconstructed volume is updated.

In circular cone-beam volume reconstruction, there are two types of geometric symmetries, which are referred to as the rotational symmetry [_{1} and the voxel position _{2}, _{2}), (_{3}, _{3}), and (_{4}, _{4}). This means that they share the same geometric relation in projection layout. The backward-projection can be significantly speeded up by the utilization of rotational symmetry, since the geometry transform matrix described in formula (

Rotational symmetry in projection layout.

Arrange the projection images in the four rotational symmetric views of

Employ four textures

In GPU vertex shader, the computation described in formula (

In GPU fragment shader, the projection images in the four symmetric views are backward-projected and accumulated to the four slice textures, respectively, according to the projection texture coordinates produced by the vertex shader and the succeeding rasterizer of GPU. The four slice textures are then rendered to GPU frame buffer in the same pass by the Multiple Render Targets (MRT) technique of OpenGL.

The above procedures are repeated with Ping-Pong technique, until the projection images of all views are backward-projected and accumulated to the four slices. The effort of backward-projection from full 360° arc is reduced to one 90° arc by using the rotational symmetry.

A new rendering pass is appended in the end. In this pass, as shown in Figure

Strategies for slices accumulating and packing.

The above processes are repeated for every volume slice from every projection view, and then the entire reconstructed volume is updated.

Another type of symmetry is known as vertical symmetry in circular cone-beam projection layout. As shown in Figure

The vertical symmetry in circular cone-beam projection layout.

We use the property of vertical symmetry to decrease the amount of backward-projection positions calculation. When loading a projection image into GPU memory, we read the data in the upper half of the projection image along

As for GPU-accelerated algorithms, the projection data should be firstly loaded into graphic card memory so as to be called by GPU, which required expensive data transfers between graphic card memory and system memory due to bandwidth limit. Since the reconstruction of each slice needs the projection images from all projection views, we try to load all the projection images into graphics card memory at one time for saving data transfer time. Currently, graphics cards have typically 512 MB or 768 MB of RAM. If the amount of projection data exceeds the graphic card memory capacity, the projection data have to be partitioned into blocks to fit into the graphic card. A new partitioning scheme is employed in our program. As shown in Figure

The partitioning scheme of the reconstructed volume and projections.

Let _{D}

If the reconstruction volume is divided into _{V}

The corresponding top projection position _{n}_{n}

According to the obtained parameters _{n}_{n}

In conjunction with the geometric symmetries presented in Sections

The projection at each view is divided into blocks according to our partitioning scheme, and each block is further decomposed into two vertical symmetric subblocks for utilizing vertical symmetry, that is, an upper sub-block and a lower sub-block.

The data in the upper sub-block of current block are packed into a texture with RGBA color channels every four rotational symmetric views, and the data in the lower sub-block are also packed into another texture with RGBA color channels every four rotational symmetric views. All data in the current block are transferred into graphic card memory from system memory by these textures for subsequent image reconstruction.

According to the algorithms presented in Sections

Once the image reconstruction for every four slices is achieved, they are packed into an output texture with four channels, one slice per channel, and downloaded into system memory.

The above 2nd to 4th steps are executed repeatedly until every volume slice in every chunk is processed from all projection views, then the image reconstruction for entire volume is achieved.

To test the gain of our GPU-based acceleration scheme, we have used the FDK algorithm that applies the GPU-based backward-projection to reconstruct images from computer simulated data and real mouse data acquired with a microcone-beam CT system. The PC used has a 1.83 GHz Intel Xeon 5120 dual-core CPU with 8 GB of system memory. The graphics card is NVIDIA Qurdro FX4600 model with 768 MB of memory. For the simulated data, the source-to-rotation center (SOD) is set to 1660.0 mm, the source-to-detector distance (SDD) is 1900.0 mm, and the size of each detector bin is 0.127 mm

We have performed reconstructions for Shepp-Logan phantom volumes with 512^{3} and 1024^{3} voxels by use of the FDK algorithm with a GPU-based backward-projection. In this reconstruction, the detector array sizes are 512^{2} and 1024^{2}, and the numbers of projection views are 360 and 720, respectively. The programmable pipeline of FX4600 GPU supports 32-bit float precision calculation

Shepp-Logan phantom reconstruction (middle slice): (a) is the true image, and (v) is the reconstruction by use of GPU-accelerated FDK program, (c) is a line profile along

Since our graphics card memory is 768 MB, the projections for reconstructing the volume with 512^{3} voxels can be uploaded into graphics card memory at one time, and the backward-projection takes 7.2–7.7 seconds. While the projections in 32-bit float precision for reconstructing the volume with 1024^{3} voxels are too large to be transferred to graphics card memory at one time. The projections need to be partitioned to fit into the graphics card memory on the basis of our partitioning scheme presented in Section _{D}

We have also applied the GPU-accelerated FDK algorithm to the real mouse data acquired with a microcone-beam CT scanner. The projection size for each projection view is ^{3}. Again, this system is too large for one shot reconstruction, and the projection data needs to be partitioned. The backward-projection time is about 14.5–15.2 seconds given by 3 partitions of the projection data. Figure

A mouse reconstructed by use of GPU-accelerated FDK program from the micro-CT data: (a) the sagittal slice of the mouse, (b) the middle transverse slice of the mouse.

In the work, we have investigated and implemented a GPU-based 3D cone-beam CT image reconstruction algorithm for large data volume, and evaluated the GPU-based implementations by use of computer-simulation data and real micro-CT data. The GPU-based implementation using geometric symmetries has speeded up the backward-projection process by about 110–120 times for a circular cone-beam scan, as compared to the CPU-based implementation on the same PC. The volumes reconstructed by GPU and CPU have virtually identical image quality. Further work is in progress to apply our algorithms to the iterative image reconstruction methods of cone-beam CT.

This work is partially supported by the National Natural Science Foundation of China (Grant no. 60472071 and no. 60532080), and the New Star Plan of Science and Technology of Beijing China (Grant no. 2005B49). The authors gratefully acknowledge Professor Pan Xiaochuan and Dr. Bian Junguo from the University of Chicago for providing the real mouse data acquired with a microcone-beam CT scanner.