^{1,2}

^{1,2}

^{1}

^{2}

^{1}

^{2}

Diffusion Tensor Imaging (DTI) allows to noninvasively measure the diffusion of water in fibrous tissue. By reconstructing the fibers from DTI data using a fiber-tracking algorithm, we can deduce the structure of the tissue. In this paper, we outline an approach to accelerating such a fiber-tracking algorithm using a Graphics Processing Unit (GPU). This algorithm, which is based on the calculation of geodesics, has shown promising results for both synthetic and real data, but is limited in its applicability by its high computational requirements. We present a solution which uses the parallelism offered by modern GPUs, in combination with the CUDA platform by NVIDIA, to significantly reduce the execution time of the fiber-tracking algorithm. Compared to a multithreaded CPU implementation of the same algorithm, our GPU mapping achieves a speedup factor of up to 40 times.

Diffusion-Weighted Imaging (DWI) is a recent, noninvasive Magnetic Resonance Imaging (MRI) technique that allows the user to measure the diffusion of water molecules in a given direction. Diffusion Tensor Imaging (DTI) [

The process of using the measured diffusion to reconstruct the underlying fiber structure is called fiber tracking. Many different fiber tracking algorithms have been developed since the introduction of DTI. This paper focuses on an approach in which fibers are constructed by finding geodesics on a Riemannian manifold defined by the DTI data. This technique, called geodesic ray-tracing [

One of the largest downsides of this algorithm is that it is computationally expensive. Our goal is to overcome this problem by mapping the geodesic ray-tracing algorithm onto the highly parallel architecture of a Graphical Processing Unit (GPU), using the CUDA programming language. Since fibers can be computed independently of each other, the geodesic ray-tracing algorithm can be meaningfully parallelized. As a result, the running time can be reduced by a factor of up to 40, compared to a multithreaded CPU implementation. The paper describes the structure of the CUDA implementation, as well as the relevant design considerations.

In the next section, we discuss the background theory related to our method, including DTI and the geodesic ray-tracing algorithm. Next, we give an overview of past research related to the GPU-based acceleration of fiber tracking algorithms. In Section

DTI allows us to reconstruct the connectivity of the white matter, thus giving us greater insight into the structure of the brain. After performing DWI for multiple different directions, we can model the diffusion process using a second-order tensor

3D glyphs visualizing diffusion tensors. The orientation and sharpness of the glyphs depend on the eigenvectors and eigenvalues of the diffusion tensor, respectively. In this image, the glyphs have been colored according to the orientation of the main eigenvector (e.g., a main eigenvector of

(a) Small group of fibers generated using a simple streamline method. The main eigenvectors of the diffusion tensors determine the local orientation of the fibers. (b) Fibers showing part of the Cingulum and the corpus callosum. Both the glyphs in the left image and the plane in the right image use coloring based on the direction of the main eigenvector, similar to Figure

One possible solution to the limitations of classic streamline methods is to use a global minimization solution, for example, a

The focus of this paper is a relatively new fiber tracking algorithm based on geodesics, as proposed by Sepasian et al. [

Fibers of part of the corpus callosum, computed using the streamlines method (a) and the geodesic ray-tracing method (b). Fibers were computed from seed points located in the center of the corpus callosum and are colored according to their local orientation (similar to the glyphs in Figure

A geodesic is defined as the shortest path on a Riemannian manifold. This is a real, differentiable manifold, on which each tangent space is equipped with a so-called

In the algorithm discussed in this paper, the trajectory of a fiber is computed iteratively by numerically solving a set of ODEs. The ODEs used to compute the trajectory of the fiber are derived from the theory of geodesics in a Riemannian manifold, as shown below.

Let

The geodesic is the line that

Let

Here,

Given an initial position and an initial direction, we can construct a path through the 3D DTI image, using the second-order Runge-Kutta ODE solver. The initial position is usually specified by the user, who is interested in a particular area of the tissue (in our case, the white matter). For the initial direction, we can use a large number of directions per seed points (distributed either uniformly on a sphere, or around the main eigenvector), and see which of the resulting fibers intersect some user-specified target region. Doing so increases our chances of finding all valid connections between the seed point(s) and the target region(s).

This approach, which is referred to as the Region-to-Region Connectivity approach, requires a suitable

The possibility of using the GPU to accelerate fiber tracking algorithms (using CUDA or other languages) has recently been explored in other literature [

More recently, Mittmann et al. introduced a GPU implementation of a simple streamline algorithm using CUDA [

Jeong et al. [

We therefore conclude that our solution is the first CUDA-aided acceleration of a fiber tracking algorithm of this complexity. As will be shown in the next sections, the complex nature of the algorithm introduces a number of challenges to the parallelization process. Furthermore, the advantages of the geodesic ray-tracing algorithm, as discussed in Section

In Section

Here,

To summarize, a single integration step of the ODE solver consists of the following actions.

Compute the inverse diffusion tensor and its derivative in all three dimensions for the eight voxels surrounding the current fiber point (

Interpolate the four tensors at the current fiber point.

Compute all Christoffel symbols. According to (

Using the Christoffel symbols, compute the position and direction of the next fiber point.

Repeat steps 1 through 4 until some stopping condition is met. By default, the only stopping condition is the fiber leaving the volume of the DTI image.

We note that the first step may be performed as a preprocessing step. While this increases the amount of memory required by the algorithm, it also significantly decreases the number of required computations per integration step. This pre-processing step has been implemented in CUDA, but due to its trivial implementation and relatively low running time (compared the that of steps 2 through 5), we do not discuss it in detail, instead focusing on the actual tracking process. Figure

Flowchart for the geodesic fiber tracking algorithm. Using the four input tensor fields, we compute the trajectory of a fiber using a numerical ODE solver.

The Region-to-Region Connectivity approach outlined in Section

NVIDIA’s

A CUDA-enabled GPU generally consists of the

One other important consideration when designing CUDA algorithms is the memory hierarchy. The two memory blocks local to each multiprocessor—the register file and the shared memory—both have low memory access latencies. However, the available space in these two memory blocks is limited, with typical values of 16 kB for the shared memory, and 16 or 32 kB for the register file. The device memory has far more storage space, but accessing this memory adds a latency of between 400 and 900 clock cycles [

Local memory is used when the register file of a multiprocessor cannot contain all variables of a kernel. Since access latencies to the device memory are very high, the use of large kernels is generally best avoided. As a general rule, communication between the device memory and the multiprocessors should be kept as low as possible. A second important guideline is that the size of the kernels, in terms of the number of registers per thread and the amount of shared memory used per thread block, should be kept small. Doing so will allow the GPU to run more threads in parallel, thus increasing the

General structure of a CUDA-enabled GPU, showing the device memory, the multiprocessors, the scalar processors (SP), and the relevant memory spaces.

Keeping in mind the advantages and limitations of CUDA as discussed in Section

This gives us four input tensors per voxel: the diffusion tensor and the three derivatives of its inverse. With six unique elements per tensor and four bytes per value, this gives us a memory requirement of

The output of the algorithm consists of a number of fibers, each consisting of a list of 3D coordinates. In postprocessing steps, these fibers can be filtered through the target region(s) and sorted according to their Connectivity Measure value, but these steps are not part of the CUDA implementation discussed herein. Since CUDA does not support dynamic memory allocation, we hard-coded a limit in the number of iteration steps for each fiber, and statically allocated an output array per fiber of corresponding size. We note here that the choice of this limit may impact performance: for high limits, more fibers will terminate prematurely (due to leaving the volume), leading to lower occupancy in later stages of the algorithm, while for lower limits, the start-up overhead of the CUDA algorithm may become relevant.

We intuitively recognize two different approaches for parallelization of the geodesic ray-tracing algorithm:

We therefore use the per-fiber parallelization approach, in which each thread computes a single fiber. The main advantage of this approach is that it does not require any spatial coherence between the fibers to efficiently utilize the parallelism offered by the GPU. As long as we have a high number of seed points and initial directions, all scalar processors of the GPU will be able to run their own threads individual of the other threads, thus minimizing the need for elaborate synchronization between threads, and guaranteeing a stable computational throughput for all active processors. The main disadvantage of this approach is that it requires a higher memory throughput than the per-region approach, as it does not allow us to avoid redundant memory reads.

The kernel that executes the fiber tracking process executes the following steps, as summarized in Figure

Fetch the initial position and direction of the fiber.

For the current fiber position, fetch the diffusion tensor and the derivatives of its inverse, using trilinear interpolation.

Using these four tensors, compute the Christoffel symbols, as defined in (

Using the symbols and the current direction, compute the next position and direction of the fiber, using the ODEs of (

Write the new position to the global memory.

Stop if the fiber has left the volume

As noted in the previous section, the main disadvantage of the per-fiber parallelization approach is that it does not allow us to avoid redundant reads. We can partially solve this problem by storing the input images in

A second, more important advantage of the texture memory is the built-in texture filtering functionality. When reading texture data from a position located between grid points, CUDA will apply either nearest-neighbor or linear interpolation using dedicated texture filtering hardware. When storing the input data in the global memory, all interpolation must be done

While we expect the use of texture-filtering interpolation to be beneficial for the running time of our algorithm (which we will demonstrate in the next Section), we do identify one possible trade-off. One property of in-kernel interpolation is that the values of the eight surrounding voxels can be stored in either the local memory of the thread, or in the shared memory of the multiprocessor. Doing so increases the size of the threads, but also allows them to reuse some of this data, thus reducing the memory throughput requirements. Using texture-filtering interpolation, we cannot store the surrounding voxel values, so we need to read them again in every integration step. Thus, in-kernel interpolation may require a significantly lower memory throughput than texture-filtering interpolation, especially for small step sizes (in which case it takes multiple steps for a fiber to cross a cell). We analyze this trade-off through experimentation in the next section.

For the experiments presented in this section, we used a synthetic data set of 1024 × 64 × 64 voxels with 2048 predefined seed points. The seed points were distributed randomly to mimic the low spatial coherence of the fibers, and their initial directions were chosen in such a way that no thread would terminate prematurely due to its fiber leaving the volume (i.e., all fibers stay within the volume for the predefined maximum number of integration steps, running parallel to the long edge of the volume). While neither the shape and content of the volume nor the fact that no fibers leave the volume prematurely can be considered realistic, this does allow us to benchmark the algorithm under maximum load.

All experiments were conducted on an NVIDIA GTX 260, a mid-range model with 24 multiprocessors (for a total of 192 scalar processors) and 1 GigaByte of device memory [

It should be noted that we only consider the actual fiber tracking process for our benchmarks. The data preparation stage (which includes preprocessing and inverting the tensors, and computing the derivatives of the inverse tensors) has also been implemented in CUDA, but is considered outside of the scope of this paper due to its low complexity, trivial parallelization, and low running times compared to the tracking stage. On the GTX 260, the data preparation stage requires roughly 50 milliseconds per million voxels [

In Section

Running time for varying step size for in-kernel interpolation (blue) and texture-filtering interpolation (red).

The performance of CUDA programs is usually limited either by the maximum

When computing 2048 fibers, each executing 4096 integration steps, the algorithm takes approximately 165 milliseconds to complete. For each integration step, a single thread needs to read 8 voxels

To compute the computation throughput in FLOPS (floating-point operations per second), we first decompile the CUDA program using the decuda software. By inspecting the resulting code, we learn that each integration step uses 256 floating-point instructions. Note that this does not include any instructions needed for interpolation, since we are using the dedicated texture filtering hardware for this purpose. The algorithm performs a total of 256

Since neither the memory throughput nor the computational throughput is close to the specified maximum, we need to determine the limiting factor in a different way. We first rule out the computational throughput as the limiting factor by replacing the current second-order Runge-Kutta (RK2) ODE solver by a simple Euler solver. This does not reduce the amount of data transferred between device memory and multiprocessors, but it does reduce the number of floating-point operations per integration step from 256 to 195. This, however, does not significantly impact the running time (165.133 ms for RK2 versus 165.347 ms for Euler), indicating that the computational throughput (i.e., processor load) is not a limiting factor in our case.

To prove that the memory throughput is indeed the limiting factor, we need to reduce the amount of data read in each integration step, without changing the number of instructions. We do so by using the knowledge that the four input tensors of our synthetic data set share a number of duplicate values. By removing some of the redundant reads, we can reduce the memory requirements of each thread, without compromising the mathematical correctness of the algorithm. The results of this experiment are listed in Table

Effects on the running time and total memory throughput of changing the amount of data read per voxel.

Data/Step (Byte) | Time (ms) | Bandwidth (GB/s) |
---|---|---|

768 | 165.133 | 38.8 |

640 | 134.124 | 40.0 |

512 | 103.143 | 41.6 |

384 | 90.554 | 35.6 |

To evaluate the performance of our CUDA implementation, we compare its running times to those of a C++ implementation of the same algorithm running on a modern CPU. In order to fully explore the performance gain, we use four different CUDA-supported GPUs: the Quadro FX 770M, the GeForce 8800 GT, the GeForce GTX 260, and the GeForce GTX 470. The important specifications of these GPUs are shown in Table

Specifications of the GPUs in the benchmark test of Section

Device memory (MB) | Memory bandwidth (GB/s) | Number of scalar processors | |
---|---|---|---|

Quadro FX 770M | 512 | 25.6 | 32 |

GeForce 880 GT | 512 | 57.6 | 112 |

GeForce GTX 267 | 896 | 111.9 | 192 |

GeForce GTX 470 | 1280 | 133.9 | 448 |

The CPU running the C++ implementation is an Intel Core i5 750, which has four cores, with a clock speed of 2.67 GHz. In terms of both date of release and price segment, the i5 750 (released in fall 2009) is closest to the GTX 470 (spring 2010); at the time of writing, both are considered mid- to high-range consumer products of the previous generation.

For this benchmark, we use a brain DTI image with dimensions of

Our C++ implementation of the algorithm features support for multiple threads of execution (multithreading), which allows it to utilize the parallelism offered by the CPU's four cores. Let

Running times (in seconds) for the multithreaded CPU implementation of our algorithm.

Fibers computed using the CUDA ray-tracing algorithm in real data. (a) All fibers computed from a seeding region in part of the corpus callosum. (b) In order to properly analyze the computed fibers, a postprocessing step is needed. In this case, fibers were filtered through two target regions of interest and ranked and colored according to their Connectivity Measure value (see Section

We performed the same benchmarking experiment on the four CUDA-enabled GPUs, and we compared the running times to those of the best CPU configuration. The results for this test are shown in Table

Benchmark results for GPU and CPU implementation of the geodesic ray-tracing algorithm. For each configuration, we show the running time (T) in seconds, and the Speed-Up factor (SU) relative to the best CPU timing, see Figure

CPU | FX 770M | 8800 GT | GTX 260 | GTX 470 | |||||

Number of seeds | SU | SU | SU | SU | |||||

1024 | 1.403 | 0.761 | 1.8 | 0.273 | 5.1 | 0.225 | 6.2 | 0.087 | 16.1 |

2048 | 2.788 | 1.388 | 2.0 | 0.448 | 6.2 | 0.244 | 11.4 | 0.093 | 30.0 |

3072 | 4.185 | 1.995 | 2.1 | 0.760 | 5.5 | 0.256 | 16.3 | 0.107 | 39.1 |

4096 | 5.571 | 2.772 | 2.0 | 0.900 | 6.2 | 0.301 | 18.5 | 0.139 | 40.0 |

In previous sections, we described a CUDA implementation of a geodesic fiber tracking method. This CUDA program uses the knowledge that fibers can be computed independently of one another to parallelize these computations, thus reducing running times by a factor of up to 40 times, compared to multithreaded CPU implementations of the same algorithm.

While the algorithm does allow for meaningful parallelization, we do note two problems that make full exploitation of the parallelism offered by the GPU challenging.

The algorithm requires a large amount of internal storage; between the four input tensors, the Christoffel symbol, and the position and direction of the fiber, even the most efficient version of our implementation still required 43 registers per thread, while typical values for CUDA algorithms are between 8 and 32.

More importantly, the algorithm requires a large amount of data transfer, and as a result, the memory throughput becomes its limiting factor. This problem is far from trivial to solve, and while some options do exist, they all have their own downsides.

Reducing the number of bits per input value would lower the required memory throughput, but comes at the cost of reduced accuracy.

Allowing fibers to reuse the input data used for interpolation greatly increases the size of the kernel and does not work with texture-filtering interpolation, the use of which has been shown to be very beneficial in our case.

Sharing the input data between threads in one multiprocessor requires some sort of spatial coherence between the fibers, which is extremely difficult to guarantee.

As noted, the memory throughput between the device memory and the multiprocessors is the limiting factor for our performance. However, as noted in Section

Most Random-Access Memory (RAM) configurations experience a certain overhead when subsequent reads access different parts of a data range. Since a low spatial locality of the seed points will lead to such scattered access pattern, this overhead may explain our relatively low memory throughput.

According to CUDA documentation [

The throughput of the texture fetching and filtering units may become a limiting factor when large numbers of voxels are involved. The documentation of the GTX 260 states that it should be able to process 36.9 billion texels per second [

We expect the first point to be the main contributing factor, though we have not conducted experiments to either prove or disprove this hypothesis.

The GPU-based acceleration of the geodesic ray-tracing algorithm for DTI is a useful technique, but its implementation poses several challenges. Part of the problem in accelerating numerical integration algorithms such as the one under discussion in the paper is that it almost inevitably introduces unpredictable memory access patterns, while existing CUDA algorithms generally use access patterns that are more regular and predictable, and thus easier to optimize. This is a fundamental problem without an easy solution, and one that is not restricted to CUDA-enabled GPUs, but applies to other parallel platforms as well. Still, despite our implementation achieving suboptimal performance, we do believe that its significant speed-up factor of up to 40 times, coupled with the low cost and high availability of CUDA-enabled hardware, makes it a practical solution to the computational complexity of the geodesic ray-tracing algorithm for fiber tracking, and a good starting point for future acceleration of similar algorithms.

In this paper, we discussed the GPU-based acceleration of a geodesic ray-tracing algorithm for fiber tracking for DTI. One of the advantages of this algorithm is its ability to find multivalued solutions, that is, multiple possible connections between regions of the white matter. However, the high computational complexity of the algorithm, combined with the fact that we need to compute a large amount of trajectories if we want to find the right connection, makes it slow compared to similar algorithms. To overcome this problem, we accelerated the algorithm by implementing a highly parallel version on a GPU, using NVIDIA’s CUDA platform. We showed that despite the large kernel size and high memory requirements of this GPU implementation, we were able to speed up the algorithm by a factor of up to 40. This significant reduction in running time using cheap and widely available hardware greatly increases the applicability of the algorithm.

Aside from further optimization of our current GPU implementation, with the aim of further reducing its running time, we identify two possible extensions of the work presented in this paper.

At the moment, the computed fibers are downloaded back to the CPU, where they are postprocessed and subsequently visualized. A more direct approach would be to directly visualize the fibers computed by our algorithm, using the available rendering features of the GPU. If we can also implement the necessary postprocessing steps in CUDA, we can significantly reduce the amount of memory that needs to be transferred between the CPU and the GPU, thus accelerating the complete fiber tracking pipeline.

The Region-to-Region Connectivity method is a valuable application of the geodesic ray-tracing algorithm. This method introduces three extra steps: (1) computing a suitable Connectivity Measure, either while tracking the fibers or during a post-processing step, (2) filtering the computed fibers through a target region, and (3) sorting the remaining fibers according to their Connectivity Measure, showing only those fibers with high CM values. These three steps have currently been implemented on a CPU, but implementing them on a GPU can reduce the overall processing time of the fiber tracking pipeline, as noted above.

The DTI model is limited in its accuracy by its inability to model crossing fibers. High Angular Resolution Diffusion Imaging (HARDI) aims to overcome this limitation by measuring and modeling the diffusion in more directions [

We note that, while these extensions would be valuable additions, the current CUDA implementation already provides a practical and scalable solution for the acceleration of geodesic ray-tracing.

This project was supported by NWO project number 639.021.407.