Parallel Implementation of Katsevich's FBP Algorithm

For spiral cone-beam CT, parallel computing is an effective approach to resolving the problem of heavy computation burden. It is well known that the major computation time is spent in the backprojection step for either filtered-backprojection (FBP) or backprojected-filtration (BPF) algorithms. By the cone-beam cover method [1], the backprojection procedure is driven by cone-beam projections, and every cone-beam projection can be backprojected independently. Basing on this fact, we develop a parallel implementation of Katsevich's FBP algorithm. We do all the numerical experiments on a Linux cluster. In one typical experiment, the sequential reconstruction time is 781.3 seconds, while the parallel reconstruction time is 25.7 seconds with 32 processors.


INTRODUCTION
Spiral cone-beam CT can be used for rapid volumetric imaging with high longitudinal resolution and for efficient utilization of X-ray source. Katsevich's filtered-backprojection (FBP) inversion formula represents a significant breakthrough in this area [2][3][4]. However, sequential implementation of this formula demands more intensive computation and hardware resources [2,3]. Parallel computation technique provides an effective solution to this issue.
Parallel computation technique was previously applied to 2D CT. Nowinski [5] investigated four forms of parallelism in 2D CT: pixel, projection, ray, and operation parallelisms. Using the data-parallel programming style, Roerdink and Westenberg [6] studied parallel implementation for two standard 2D reconstruction algorithms: FBP and direct Fourier reconstruction. There were also some results on parallel implementations of 3D CT image reconstructions for parallelbeam and fan-beam geometries [7,8]. For cone-beam CT, parallel implementations of the Feldkamp algorithm on Beowulf clusters were reported, typically based on smart communication schemes [9,10]. In [9], a master node does all the weighting and filtering of the projections, while the other nodes perform the backprojection for their assigned image subvolumes, respectively. In [10], each node weights and filters the projection data assigned to it, and then accomplishes the backprojection for its assigned image subvolume in a send-receive mode.
For image reconstruction from cone-beam projections, the backprojection step is extremely time-consuming. In [1], the authors proposed the cone-beam cover method for the backprojection. This method is different from the conven-tional methods based on PI line, and provides an alternative efficient implementation scheme for Katsevich's FBP formula and its kind. In the cone-beam cover method, any filtered cone-beam projection can be backprojected independently. On the ground of this independence, we present a parallel implementation for Katsevich's exact reconstruction formula in this manuscript. Numerical simulations demonstrate the high performance of the proposed parallel scheme with remarkable runtime reduction. For the 3D Shepp-Logan phantom [11] with 256 × 256 × 256 voxels, 3 × 600 source points (600 points per turn) and 100 × 500 cone-beam projection at every source point, the sequential cone-beam cover algorithm takes 781.3 seconds to reconstruct the volume image, while the proposed parallel implementation only needs 25.7 seconds with 32 processors on the same Linux cluster.
The structure of this manuscript is as follows. Section 2 introduces briefly Katsevich's inversion formula and the cone-beam cover method. Section 3 describes the proposed parallel implementation of Katsevich's inversion formula. Section 4 provides experiment results and details of the computing environment. Section 5 discusses relevant issues. Finally, Section 6 concludes the paper.

THE CONE-BEAM COVER METHOD
As shown in Figure 1, let C be a spiral defined by C := y ∈ R 3 : y 1 = R cos(s), y 2 = R sin(s), where s is the angular parameter, h > 0 is the spiral pitch and where 0 < r < R, c < d, a = c − s Δ , b = d + s Δ , s Δ is the necessary offset of the angular parameter [1][2][3][4]. It is assumed that an object f (x) is centered at the coordinate origin and supported by U, that is, f (x) = 0 if x ∈ U. The cone-beam (at vertex y) projection of f is defined by Katsevich's FBP inversion formula [3] can be represented as where is the PI parametric interval which is determined by the PI line L PI (x) passing through x [12], see Figure 2. This formula can be rewritten as where The section of the spiral between y(s b (x)) and y(s t (x)) is called PI arc, denoted as C PI (x).
for the unit vector β along which X-ray emitted from y(s) will reach the Tam-Danielsson window [13,14], see Figure 3.
Equations (5) and (7) imply that the inversion formula is of the filtered-backprojection type. One first computes the shift-invariant filtering of derivative of cone-beam projection using (7) [3,15,16]. Then one performs the backprojection according to (5). Here, for every x ∈ U, the backprojection is performed along PI arc C PI (x) (the PI line method).
In [1], we introduce the cone-beam cover method to perform backprojection in a way different from the PI line method.
cone-beam cover at source point y(s 0 ), where x is the projection of x onto the detector plane, see Figure 4.
The following theorem on the cone-beam cover was proved in [1]. It is the footstone of the cone-beam cover method.
Applying Theorem 1 to the discrete form of (5): we obtain the following equation Equation (11)  method provides an approach to performing backprojection.
The key point is as follows. At any source point y(s), the cone-beam projection is truncated by those unit vector β along which X-rays emitted from y(s) will reach a region slightly larger than Tam-Danielsson window W(s) on the detector plane [3]. As shown in Figure 5, the region is bounded by the parallelogram. Dealing with this cone-beam projection, one first computes the shift-invariant filtering of derivative of the cone-beam projection using (7) along the lines L(S 2 ) (see Figure 5), as discussed elsewhere [3,16]. Then one performs the backprojection according to (11). Here with the filtered cone-beam projection at source point y(s) the backprojection is performed only for x ∈ V (s). Note that using the above strategy, both filtering and backprojecting on a single cone-beam projection can be performed independently. This property of the strategy leads to the sequential and parallel algorithms we will describe later.
A sequential implementation of Katsevich's FBP formula is stated in Algorithm 1. More details and experiment results about this implementation can be found in [1]. Step 1. for each voxel x of image space, set f (x) = 0.
Step 2. for each source point y(s), s ∈ I do (1) Filtering of derivative of the cone-beam projection, get Ψ(s, β(s, x)).

PARALLEL IMPLEMENTATION
We parallelize the above sequential implementation by partitioning the source points set. Let I be a discretization of I (angular parameter interval), p be the number of processors. If I is partitioned into p subsets then a parallel implementation of Katsevich's FBP formula can be pseudocoded Algorithm 2.
In Algorithm 2, each processor j just operates on its assigned cone-beam projections and gets f j (x), and no communication between different processors is needed during the processing.

EXPERIMENTS
We do all computations on the Beowulf cluster CCSE-HP at Peking University. The cluster consists of 1 master node, 2 login nodes, 4 I/O nodes, and 128 computing nodes. All the nodes are on a Gigabit Ethernet. The computing nodes are also connected by 4 X InfiniBand (10 Gbps). The system configuration is listed in Table 1.   β(s, x)).
(2) Backprojecting for x ∈ V (s) The parameters of the data collection protocol are given in Table 2.
We use the 3D Shepp-Logan phantom [11] to test our parallel algorithm. The phantom consists of 10 ellipsoids as specified in Table 3. Figure 6 shows the reconstructed images from sequential and parallel implementations at slices of x 3 = −0.255, x 2 = −0.067, and x 1 = −0.067, respectively. We use the gray scale window [1.01, 1.03] to make low contrast features visible.
In terms of runtime, speedup S is defined as the ratio of the time taken to solve a problem on a single processor to the time required to solve the same problem on a parallel computer with P identical processors. Efficiency E is defined as the ratio of speedup to the number of processors: where T 1 and T P represent the computing time for one processor and P processors respectively [17]. Figures 7 and 8 display the speedup and efficiency of Algorithm 2, respectively.

DISCUSSIONS
Since the computation of image reconstruction in Algorithm 2 is mathematically equivalent to that in Algorithm 1, the image reconstructed by Algorithm 2 is the same as that done by Algorithm 1 .  Figures 7(a) and 8(a) depict the algorithm's speedup and efficiency with the uniform partition of I. We can see that the speedup increases from 1.0 to 22.9 while the efficiency decreases from 100.0% to 71.5% when the number of processors varies from 1 to 32. Obviously, the efficiency is not satisfactory with four or more processors. Figure 9 displays      the wall time 1 for each processor. The graph looks like a trapezoid, exhibiting a problem of load imbalance. The processors in the middle bear higher load, while those on the two sides bear lower load.
The reason is as follows. When the X-ray source point goes near the bottom or top of the image space, the conebeam cover becomes smaller and smaller, making the load of the corresponding processors lower and lower. Nonuniform partition of I can overcome the load imbalance. The following describes a scheme of nonuniform partition of I.
Recall that I = [a, b] is the angular parameter interval of the spiral, the interval [c, d] expresses the axial position of image space U. Let s Δ be the maximal axial-direction distance of points in the cone-beam cover from the source point, then Figure 10 illustrates computation time for the filtering and backprojecting at source point y(s).
We divide [a, a + 2s and let determines a nonuniform partition of I. Figures 7(b) and 8(b) display the algorithm's speedup and efficiency with the above nonuniform partition of I. We can see that the speedup increases from 1.0 to 30.4 and the efficiency decreases from 100% to 95% when the number of processors varies from 1 to 32. The speedup and the efficiency with the nonuniform partition of I are apparently better than those with the uniform partition of I. The algorithm's speedup and efficiency maintain a higher level with larger image sizes. The algorithm's speedup and efficiency with the image volume 512 × 512 × 512 are displayed in Figure 11. The idea of the cone-beam cover is a key point for us to design the parallel implementation of Katsevich's inversion formula. From the perspective of the cone-beam cover, backprojecting any filtered cone-beam projection can be performed independently. Therefore a partition of conebeam projections gives a parallel implementation of image reconstruction. The presented results demonstrate the high performance of the proposed parallel algorithm. The cone-beam cover method can also be employed to establish parallel schemes for other reconstruction algorithms, such as Feldkamp-type algorithms [11,18], Zou and Pan's algorithm [19,20]. In addition, parallel implementation of the Katsevich's FBP algorithm based on PI line method would be an important topic of future research.

CONCLUSIONS
For spiral cone-beam CT, parallel computing is an effective approach to improving the reconstruction efficiency. Basing on the cone-beam cover method, we have proposed an efficient parallel implementation of Katsevich's FBP algorithm and demonstrated its high performance with numerical simulations. Further work is under active investigation.