3D Alternating Direction TV-Based Cone-Beam CT Reconstruction with Efficient GPU Implementation

Iterative image reconstruction (IIR) with sparsity-exploiting methods, such as total variation (TV) minimization, claims potentially large reductions in sampling requirements. However, the computation complexity becomes a heavy burden, especially in 3D reconstruction situations. In order to improve the performance for iterative reconstruction, an efficient IIR algorithm for cone-beam computed tomography (CBCT) with GPU implementation has been proposed in this paper. In the first place, an algorithm based on alternating direction total variation using local linearization and proximity technique is proposed for CBCT reconstruction. The applied proximal technique avoids the horrible pseudoinverse computation of big matrix which makes the proposed algorithm applicable and efficient for CBCT imaging. The iteration for this algorithm is simple but convergent. The simulation and real CT data reconstruction results indicate that the proposed algorithm is both fast and accurate. The GPU implementation shows an excellent acceleration ratio of more than 100 compared with CPU computation without losing numerical accuracy. The runtime for the new 3D algorithm is about 6.8 seconds per loop with the image size of 256 × 256 × 256 and 36 projections of the size of 512 × 512.


Introduction
Recently, iterative image reconstruction (IIR) algorithms [1][2][3][4][5][6], especially compressive sensing (CS) [7][8][9][10] based ones, have been developed for X-ray computed tomography (CT). As is widely known, CS based IIR algorithms can provide much higher image quality than the popular Feldkamp-Davis-Kress algorithm [11] (FDK) under sparse views. Constrained total variation (TV) based methods obtain impressive results for sparse view reconstruction in CT imaging [3,12]. Although theoretical researches show that IIR possesses great advantages over analytical ones in image quality, it is still far from being put into practical use due to the expensive computation cost, especially for cone-beam computed tomography (CBCT). Fast image reconstruction is often required in clinical use to reduce the waiting time for the patient. Reconstruction speed is even more critical in realtime imaging applications, such as cardiac CBCT or online therapy.
Researchers in both optimization theory and hardware acceleration have made lots of progresses, aiming at developing more robust and efficient methods. The development of TV minimization indicates that the alternating direction method (ADM) [12,13] can provide relatively better results. The representative algorithms using ADM are Lagrangian function based ones [14] and split Bregman method [15]. The two ADM based methods are equivalent under linear constraints. Both of the two kinds of optimization methods have been applied in CT reconstructions [12,13,16]. From another point of view, CBCT reconstruction can be regarded as an instance of high-performance computing [17]. Therefore, parallel processing can serve as an acceleration technique.
Originally designed for accelerating the computer graphics computation, the graphics processing unit (GPU) has emerged as a versatile platform for running massively parallel computation [18][19][20][21]. GPU provides clear advantages for CBCT image reconstruction: high memory bandwidth, high computation throughput, support for floating-point arithmetic, low price cost, and friendly programming interface. The acceleration of the filtered back-projection type algorithms using GPU represents a classic implementation of a nongraphics application on dedicated graphics hardware [22]. With the development of compute-specific APIs, CBCT reconstruction was accelerated using Brook [23] and CUDA [24][25][26]. In CUDA, FDK acceleration mainly focuses on back-projection making use of techniques of thread assigning, memory optimization, in-built arithmetic instructions, and so on [25,26]. However, parallel processing for IIR algorithms meets new issues because these algorithms are fundamentally sequential. GPU acceleration needs sufficiently parallel workload [27]. Therefore, algorithms with minimal computation within each loop are not proper for GPU implementation. For instance, the algebraic reconstruction technique (ART) is not suitable for the GPU because each loop only processes a single beam. A more suitable algorithm is simultaneous ART (SART), which updates the image after the back-projection of an entire projection view. Several other iterative algorithms were also adapted to the GPU [28][29][30], including total variation reconstruction [31].
This paper proposes an efficient 3D IIR algorithm based on alternating TV minimization method for CBCT reconstruction based on GPU acceleration. An inexact ADM iteration using local linearization and proximity technique is adopted to avoid the pseudoinverse calculation. The experiments using both simulation and real CT data prove that the proposed algorithm for CBCT is both fast and accurate. The paper is outlined as follows. Section 1 briefly discusses the incomplete data CBCT reconstruction problems and related works. Section 2 shows the new method in detail and its parallelization analysis. The CUDA implementation and experiments on both simulation and real data results are introduced and shown in Section 3. Finally, Section 4 brings a brief discussion and conclusion.

Methods
A CBCT scanning system mainly consists of an X-ray source, interested object, and a flat panel detector. From a discrete to discrete point of view, the image system can be modeled as the following linear system: where vector ∈ R rays has a length of rays which is the vectorization of the projection data; the vector ∈ R voxels has a length of voxels which stands for the discrete vectorized form of the object function. Matrix ∈ R rays × voxels models the imaging system which has rays rows and voxels columns. In this work, the value in the system matrix is modeled using the ray intersection length with the cubic voxel. For incomplete angle problem, (1) is always undersampled and illconditioned. The CS theory indicates that the linear system can achieve exact solution under certain sparse representation by the following L1-norm minimization: * = arg min Ψ ( ) 1 , (2) For CT images, it is always the case that most of the images have very sparse gradient-magnitude images (GMI) [3]. It is a good tool to use GMI for CS based image reconstruction which is the origination of the famous TVbased algorithms.

Review of Alternating Direction TV Minimization Reconstruction.
First of all, a brief review of alternating direction TV minimization reconstruction (ADVTM) [12] algorithm is carried out for the completeness of this paper. Apply the TV regularization to (1); then we will get the constrained TV minimization reconstruction model. Here, we use the anisotropic TV for CBCT reconstruction; that is, ‖Ψ( )‖ 1 = ‖ ‖ ≜ ∑ ‖ ‖ 1 , = 1, 2, 3. Here, 1 , 2 , and 3 stand for the differential operator in , , and directions. The unconstrained form of (2) can be written as where stands for the penalty factor. Let = ; equation (3) can also be transformed as The corresponding Lagrangian function of the above problem is min ( , , ) = min where ∈ R voxels is multiplier, and ∈ R is the factor for square formation. Under the ADM framework, splitting the variables and , we get the following iteration form: Computational and Mathematical Methods in Medicine 3 The minimization with respect to has the following closed form solution: ) .
For the minimization with respect to , the optimization is a quadratic function and set its derivative to 0: The basic idea to find the solution to the above equation is to calculate the pseudoinverse of ∑ + . Therefore, the exact solution for the ( + 1)th iteration of the above subproblem is as in the following expression: where + stands for the Moore-Penrose pseudoinverse of matrix . The update form of multipliers is Therefore, the ADTVM algorithm has the following iteration form.

End Do
The ADTVM algorithm use exact solutions for each subproblem at each iterative loop and it has the assurance of the convergence. The application of ADTVM algorithm for 2D reconstruction has already shown some impressive results [12].

The 3D Inexact Alternating Direction Reconstruction.
It can easily be seen that the ADTVM reconstruction has a very simple iteration form, and its convergence property makes it a robust algorithm. However, let us take a more careful analysis of the above algorithm. In fact, it should be pointed out that the ADTVM iteration involves a very expensive calculation of the pseudoinverse for a huge matrix ∑ + . More seriously, the ADTVM may fail in cone-beam reconstruction for even a small scale of 3D data set, saying a cube having size of 256 × 256 × 256. Actually, it is impossible to have such huge memory to store the cone-beam system matrix for a personal computer. Consequently, for a cone-beam reconstruction problem, the ADTVM is actually not applicable for it cannot be implemented. In fact, methods that only use and its transpose make sense in finding the solution to cone-beam reconstruction problems. Therefore, it is essential to develop a more practical and efficient algorithm for 3D reconstruction based on alternating direction method.
In this subsection, a practical alternating direction reconstruction using local linearization and proximity technique is proposed with GPU aided computation. In matrix computation theory [31], matrix with some special structures, such as diagonal matrixes or those which can be diagonalized by FFTs, can help in improving the calculation performance greatly. However, for the general matrix in CBCT, is neither diagonal nor FFT diagonalizable. We adopt an inexact strategy to tackle this subproblem of minimization for . For minimization with respect to , the fidelity term of ‖ − ‖ 2 in (6), that is, the term containing , is linearized at the current point ( ) and its proximal form is where = ( ( ) − ) is the gradient of ‖ − ‖ 2 at the current point of ( ) , and > 0. Consequently, the subproblem of can be converted into the following form: Set the derivative of the above quadratic function to 0, we get Under the periodic boundary condition, ∑ is a block circulant matrix. Therefore, the coefficient matrix on the left hand side of (13) can be diagonalized by three-dimensional fast Fourier transform F 3 via composed by the elements on the diagonal of . Apply 3D Fourier; transform both sides of (13); the solution of (13) can be computed efficiently by where the division of / is a component-wise operation. The new algorithm is implemented as the following list. (3) Update using ( +1) = + ( ( +1) − ( +1) ); (4) ← + 1

End Do
It can easily be seen that the calculation of ( +1) is closely related to ( ) which is different from that in Algorithm 1. Notably, the ADTVM involves the calculation of and ( ∑ + ) + which can only be implemented based on storing the system matrix beforehand. However, even for the occasion of 2D reconstruction, the system matrix is actually so tremendous that its pseudoinverse calculation is very time consuming. Furthermore, for 3D situation for ADTVM, there is no such a huge storage device which can accommodate such a big system matrix. Consequently, the pseudoinverse computation in ADTVM is very impractical or even impossible to be implemented for 3D reconstruction because of time and memory consumption. The new algorithm utilizes the linearization technique which ably avoids the bother of storing the system matrix. Moreover, the new method also averts the horrible computation of and ( ∑ + ) + . In addition, the involved FFT techniques can further improve the computation efficiency. These characteristics make the new algorithm an indispensable method for cone-beam image reconstruction based on the alternating direction method. The convergence property is guaranteed and discussed in detail in [32].

GPU Implementation.
The related forward-and backward-projection operations in ( ( ) − ) has very high complexity for CPU computation. Generally, the forwardprojection in Algorithm 2 can be defined as where is the attenuation coefficient, , is the value in the system matrix at position of ( , ), and is the set containing all the indices of voxels that have nontrivial intersections with the beam . Analogously, the backwardprojection can be defined as where is the set containing all the indices of beam that have nontrivial intersections with the voxel . The iteration of Algorithm 2 is simple but convergent. Although there are only one forward-and one backward-projection operation in ( ( ) − ) at each iterative loop, these two operations can occupy most of the computation time. For more efficient implementation, more advanced hardware optimization besides local linearization and proximity technique in algorithm design should be taken into consideration. Traditional method for calculating the forwardprojection is the ray tracing method proposed by Siddon. Siddon's algorithm uses a parametric line representation of the beam which makes the complexity of computing the intersection lengths of each beam with 3D domain still with respect to 1D line. For CBCT reconstruction, the system matrix is so tremendous that Siddon's algorithm is not suitable for computing both forward and backward projections simultaneously.
For efficient computation, a fast and parallel algorithm [33] for forward and backward projections is utilized in this paper. A brief review of this algorithm is given here and the detailed interpretation can be found in [33]. When computing the forward-projection, the 3D region of the object is divided into a group of planes in one direction according to the slope of the beam. This limits the number of the voxel intersected with the beam within quite few situations. Computing the length can be executed in parallel by each plane, which makes the calculation pretty efficient. When dealing with the backward-projection, the parallelization can be realized in parallel for each voxel. In finding the corresponding beams, a shadow region method is utilized [33].
Although iterative algorithm is fundamentally sequential, the reconstruction algorithm we designed here for CBCT can be implemented efficiently with the aid of GPU considerably. The three update formulas can all be computed on GPU for speedup. The operations involved in the proposed method mainly include matrix-vector multiplications and vector additions. These operations include , , , and . The operation of and can be straightforwardly put into GPU calculation, with each thread computing the difference of a voxel. The most expensive calculation parts are and which stand for forward-and backwardprojections. With the aid of the fast and parallel algorithm, the forward-and backward-projections can potentially be accelerated significantly. With the GPU aided computation, the flow chart of the proposed algorithm is shown in Figure 1.  (4) Solve f subproblem using formula (14) (5) Solve z subproblem using formula (7) Figure 1: Flow chart of the proposed inexact alternating direction CBCT reconstruction algorithm. Blocks 4-6 correspond to 1-3 of Algorithm 2.

Computation Efficiency.
To evaluate the performance of the CUDA aided implementation, we implement and test the operation of , , , and both on CPU and on GPU. In addition, there are two data sets running on NVIDIA Tesla K20c. This GPU device has 2496 CUDA cores and 5120MB global memory. In the performance test, a 3D digital Moby mouse phantom in which the attenuation coefficient is in 0.0∼1.0 is utilized. A single circle trajectory is utilized for cone-beam scanning. The source to axis distance is 30 cm, and the source to the center of the flat panel distance is 60 cm. The detector panel has a size of 12.8 cm × 12.8 cm. The phantom has a size of 6.4 cm × 6.4 cm × 6.4 cm. The projection data are collected by 36 equally angular views in 360 degrees.
In order to test the four operations under different data sets, two kinds of discretization are applied which is listed in Table 1. All the experiments are carried out on the workstation configured with dual cores of Intel Xeon CPU of E5-2620 @ 2.10 GHz (only one core was used) equipped with Tesla K20c. The time consumption for both CPU and different GPU is listed in Table 2 together with its speedup. All the time consumption is calculated by the statistical average of fifty times. The computation between CPU and GPU is expressed in root mean square error (RMSE) by RMSE = √ ∑(( CPU − GPU ) 2 / ), where stands for the total number of values; CPU and GPU stand for CPU results and GPU results, respectively. The RMSEs are listed in Table 3. The speedup in Table 2 shows that the acceleration strategy applied here can improve the performance greatly with GPU while Table 3 indicates that the numerical differences can be ignored.

Reconstruction Verifications.
The reconstruction algorithm proposed in this paper is composed of , , , , and a few vector additions and comparisons. In this subsection, reconstruction using both simulation data and real CT projections is carried out. The goal is to test the performance of the entire routine of the new algorithm and the image reconstruction quality. For the reconstruction of simulation data, the above data set of situation 2 in Section 3.1 is utilized. Its scanning configuration is the same as that in Section 3.1. For the real data reconstruction, projections of a medical head phantom are acquired with the cone-beam CT system which mainly consists of a flat panel detector (Var-ian4030E, USA) and an X-ray source (Hawkeye 130, Thales, France). The distance between source and the rotation axis of scanner is 678 mm and the distance between source and the detector is 1610 mm. The detector bin has a size of 0.508 mm × 0.508 mm. The projection size is 768 pixels × 432 pixels × 72 views. The size of reconstruction image is 384 voxels × 384 voxels × 216 voxels with 0.214 mm × 0.214 mm × 0.214 mm per voxel.
In the reconstructions, the proposed algorithm is compared with FDK algorithm and the adaptive-steepestdescent-POCS (ASD-POCS) [3] algorithm. The parameters of the new method are empirically chosen as = 1, = 1, and = 1. The parameters of ASD-POCS are the same as those in [3]. The iteration number of both simulation and real data reconstruction is 100. The simulation reconstruction results are shown in Figure 2, where a 3D slice of = 31, = 128, and = 128 is presented. The RMSEs for ASD-POCS and the proposed method for simulation reconstruction are listed in Table 4. The convergence behavior of the new method for simulation is drawn in Figure 3. The real CT data experiments use 72 equally angular views. Reconstructions of the FDK, ASD-POCS, and the new method are shown in Figure 4.
The reconstruction results of FDK algorithm in Figures 2 and 4 suffer from streak artifacts so severely that the useful and detail structures are degraded or even lost. Therefore, the FDK reconstructions from 36 or 72 views can hardly be put into practical use. The ASD-POCS and the proposed algorithms provide satisfying image quality. The reconstruction results of these two methods do not show visible differences. Meanwhile, the RMSEs behavior of the new method in Figure 3 shows a robust convergence. The time consumptions for each reconstruction are listed in Table 5. From this table, it can be seen that the GPU device plays the key role for improving the reconstruction performance, and the acceleration ratio of the new method for GPU compared with CPU is about 106 for simulation and 120 for real data reconstruction, respectively. The reconstruction qualities of the proposed algorithm for simulation data and real data are both satisfying and are potential to be put into practical use.

Discussion and Conclusion
Reconstruction performance is an important issue and its acceleration is of crucial significance for iterative algorithms and this paper try to do some related work. This paper has proposed a GPU based alternating direction reconstruction method for cone-beam CT imaging. The new method utilizes a local point linearization and proximity strategy  which avoids the calculation of pseudoinverse of matrix. The proximal process applied in the new algorithm makes it efficient and applicable for CBCT reconstruction using the ADM routine. Although the new method utilizes an approximate or inexact strategy to tackle the subproblem, the reconstructions in both simulation and real data experiments show a robust convergence property. In fact, the augmented Lagrangian function (5) is expected to be minimized by solving subproblem and subproblem alternately. Therefore, solving these two subproblems accurately at each sweep may be unnecessary. Furthermore, the advantages for the inexact strategy are not only avoiding the pseudoinverse computation, but also making the reconstructions able to efficiently be launched on GPU cards which is a key to improve the overall performance. Each calculation of the subproblems has some computation parts that can be executed in parallel on GPU cards, and the acceleration ratio for these parts can be rather high. The most important matter focuses on accelerating the most time consumption parts which will make an outstanding improvement. For the entire algorithm, the acceleration ratio is a little lower than that of each part which is mainly due to the serial computation parts running on CPU. The results in the reconstruction experiments show a considerable acceleration for the new algorithm while the reconstruction qualities are well kept.
The new algorithm applies a highly efficient technique to settle the difficulties faced by ADTVM in cone-beam imaging. Actually, the technique utilized in this paper is ingenious but necessary. The proximal method has no influence on the convergence of the algorithm. It is robust and its 3D reconstructions are both accurate and fast. Although the application presented here is circular cone-beam CT, it is clear that this algorithm and its GPU acceleration can be applied to other tomographic imaging modalities with linear system models. Future work will focus on further improving and optimizing the acceleration efficiency, so that the algorithm can be more practical for actual scanning systems.