Parametric statistical methods, such as
Functional magnetic resonance imaging (fMRI) is used in neuroscience and clinic for investigating brain activity patterns and for planning brain surgery. Activity is detected by fitting an activity model to each observed fMRI voxel time series and then testing whether the null hypothesis of no activity can be rejected or not based on the model parameters. Specifically, this test is performed by subjecting a test statistic calculated from the model parameters to a threshold. To control the randomness due to noise in this test procedure, it is desirable to find the statistical significance associated with the detection threshold, that is, how likely it is that a voxel is declared active by chance. When the statistical distribution of the data is known
In contrast to the parametric approach, a nonparametric approach does not assume the statistical properties of the input data to be known [
Graphics processing units (GPUs) have seen a tremendous development during the last decade and have been applied to a variety of fields to achieve significant speedups, compared to optimized CPU implementations. The main difference between the GPU and the CPU is that the GPU does all the calculations in parallel, while the CPU normally does them in serial. In the field of neuroimaging and neuroscience the use of the GPU is quite new. As singlesubject fMRI analysis normally is done separately for each time series in the fMRI data, it suits perfectly for parallel implementations. In our recent work [
In this work, it is shown how nonparametric statistics can be made practical for singlesubject fMRI analysis by using the parallel processing power of the GPU. The idea of using the GPU for random permutation tests is not new; it has recently been done in biostatistics [
One subset of nonparametric tests is permutation tests where the statistical analysis is done for all the possible permutations of the data. Complete permutation tests are; however, not feasible if the number of possible permutations is very large. For a time series with 80 samples, there exists
Flowchart containing the main building blocks for nonparametric analysis of singlesubject fMRI data.
By applying a threshold to the activity map, each voxel can be classified as active or inactive. The threshold is normally selected as a level of significance, one may for example want that only voxels that with at least 95% significance are to be considered as active. If a statistical test is repeated and a familywise error rate
There are three problems with Bonferroni correction in fMRI. First, the test statistics is assumed to, under the null hypothesis, follow a certain distribution, such as Student’s
The nonparametric approach can be used to solve the problem of multiple testing as well. This is done by estimating the null distribution of the
As fMRI time series are temporally correlated [
Several approaches have been proposed for the random resampling, the most common being whitening transforms [
To accurately estimate AR parameters from a small number of time points (80 in our case) is quite difficult. To improve the estimates a spatial Gaussian lowpass filter (8 mm FWHM) is therefore applied to the estimated AR parameters [
Normalized convolution [
To investigate if the time series really are white noise after the whitening, several tests can be applied. One example is the DurbinWatson test that previously has been used to test if the residuals from the GLM contain autocorrelation [
Since the spatial correlation should be maintained, but not the temporal, the same permutation is applied to all the time series [
The general linear model (GLM) is the most used approach for statistical analysis of fMRI data [
Prior to the GLM the time series were whitened by using the same AR(1) model for all the voxels [
One statistical approach for fMRI analysis that provides more adaptivity to the data is canonical correlation analysis (CCA) [
The temporal basis functions for CCA are the same as for the GLM. The spatial basis functions can, for example, be neighbouring pixels [
The four smoothing filters that are used for 2D CCA, one small isotropic separable filter and three anisotropic nonseparable filters. For visualization purposes, these filters are interpolated to a subpixel grid.
Compared to other approaches that adaptively include spatial information [
One disadvantage with CCA is that it is difficult to calculate the threshold for a certain significance level, as the distribution of the canonical correlation coefficients is rather complicated. If
Another problem is that restricted CCA (RCCA) [
As a 2D version of CCA already had been implemented [
The smoothing of the fMRI volumes has to be applied in each permutation. If the data is smoothed prior to the whitening transform, the estimated AR parameters will change with the amount of smoothing applied since the temporal correlations are altered by the smoothing. For our implementation of 2D CCA, 4 different smoothing filters are applied. If the smoothing is done prior to the permutations, 4 time series have to be permuted for each voxel and these time series will have different AR parameters. The smoothing will also change the null distribution of each voxel. This is incorrect as the surrogate null data that is created always should have the same properties, regardless of the amount of smoothing that is used for the analysis. If the data is smoothed after the whitening transform, but before the permutation and the inverse whitening transform, the time series that are given by simulating the AR model are incorrect since the properties of the noise are altered. The only solution to this is to apply the smoothing
Similarly, if the activity map is calculated as how important each voxel is for a classifier [
The complete algorithm can be summarized as follows. The reason why the detrending is done separately, compared to having the detrending basis functions in the design matrix, is that the detrending has to be done separately for the CCA approach.
The whitening in each permutation is only performed to be able to compare the corrected
Preprocess the fMRI data, that is, apply slice timing correction, motion correction, smoothing, and cubic detrending. To save time, the statistical analysis is only performed for the brain voxels. A simple thresholding technique is used for the segmentation.
Whiten the detrended time series (GLM only).
Apply the statistical analysis to the preprocessed fMRI data and save the test values. These are the original test values
Apply cubic detrending to the motion compensated time series.
Remove the best linear fit between the detrended time series and the temporal basis functions in the design matrix, by ordinary least squares, to create residual data (as the
For each permutation,
apply a random permutation to the whitened time series,
generate new fMRI time series by an inverse whitening transform, that is, by simulating an AR model in each voxel with the permuted whitened time series as innovations,
smooth all the volumes that were generated by the inverse whitening transform,
apply cubic detrending to the smoothed time series,
whiten the detrended time series (GLM only),
apply the statistical analysis,
find the maximum test value and save it.
Sort the maximum test values.
The threshold for a desired corrected
The corrected
A comparison of the flowcharts for a parametric analysis and a nonparametric analysis is given in Figure
(a) Flowchart for conventional parametric analysis of fMRI data. (b) Flowchart for nonparametric analysis of fMRI data. In each permutation a new null dataset is generated and analysed.
The number of permutations that are required depends on the desired
Relative standard deviation of the desired
Number of Permutations/ 
0.1  0.05  0.01 

1000  9.48%  13.78%  31.46% 
5 000  4.24%  6.16%  14.07% 
10 000  3%  4.35%  9.95% 
50 000  1.34%  1.94%  4.45% 
100 000  0.95%  1.37%  3.14% 
The random permutation test was implemented with the CUDA (Compute Unified Device Architecture) programming language by Nvidia [
Our CUDA implementation was compared with a standard
Before the data is permuted an AR model is first estimated for each time series, as previously described. To solve the equation system that is given by the YuleWalker equations requires a matrix inverse of the size
For the estimation of AR parameters, the whitening transform, and the inverse whitening transform the shared memory is used to store the
The permutation step is done by using randomized indices. A permutation matrix of size
{
}
is the index vector that contains the random time indices. The inverse whitening transform and the permutation step are thus performed at the same time. The help functions
calculate the linear index for the 3D and the 4D case.
For this kernel, and for most of the other kernels, each thread block contains a total of 512 threads (32 along
To find the maximum test value in each permutation, one fMRI slice (64 × 64 pixels) is first loaded into shared memory. The maximum value of this slice is then found by comparing two values at the time. The number of values is thus first reduced from 4096 to 2048, then to 1024 and after 12 reductions to the maximum test value. The maximum values of the 22 slices are then compared. After each permutation the maximum test value is copied to host memory.
In order to calculate the
As our computer contains three graphic cards, a multiGPU implementation of the analysis was also made, such that each GPU does onethird of the permutations. Each GPU first preprocesses the fMRI data, GPU 1 uses the first part of the permutation matrix, GPU 2 uses the middle part, and GPU 3 uses the last part. The processing time thus scales linearly with the number of GPUs. A short demo of the multiGPU implementation can be found at
In this section we will present the processing times for the different implementations, compare activity maps from GLM and CCA at the same significance level, and compare estimated thresholds from Bonferroni correction, Gaussian random field theory, and random permutation tests.
Four singlesubject datasets have been used to test our algorithms; the test subject was a 50yearold healthy male. The data was collected with a 1.5 T Philips Achieva MR scanner. The following settings were used: repetition time 2 s, echo time 40 ms, flip angle 90 degrees, and isotropic voxel size 3.75 mm. A field of view of 240 mm thereby resulted in slices with 64 × 64 pixels, and a total of 22 slices were collected every other second. The experiments were 160 s long, resulting in 80 volumes to be processed. The datasets contain about 20 000 withinbrain voxels.
For the
For the
For the
The processing times for the random permutation tests, for the different implementations, are given in Tables
Processing times for random permutation tests with the GLM for the different implementations.
Number of permutations with GLM 

OpenMP  CUDA, 1 × GTX 480  CUDA, 3 × GTX 480 

1000  25 min  3.5 min  25.2 s  8.4 s 
5 000  2 h 5 min  17.5 min  1 min 42 s  33.9 s 
10 000  4 h 10 min  35 min  3 min 18 s  65.8 s 
50 000  20 h 50 min  2 h 55 min  16 min 30 s  5 min 30 s 
100 000  1 day 17 h 40 min  5 h 50 min  33 min  11 min 
Processing times for random permutation tests with 2D CCA for the different implementations.
Number of permutations with 2D CCA  C  OpenMP  CUDA, 1 × GTX 480  CUDA, 3 × GTX 480 

1000  1 h 40 min  14 min 50 s  22.2 s  7.4 s 
5 000  8 h 20 min  1 h 14 min  1 min 24 s  28 s 
10 000  16 h 37 min  2 h 28 min  2 min 42 s  54 s 
50 000  3 days 11 h  12 h 22 min  13 min 30 s  4 min 30 s 
100 000  6 days 22 h  24 h 43 min  27 min  9 min 
The reason why the processing time for CCA is much longer than for the GLM for the CPU implementations is that the 2D version of CCA uses one separable filter and three nonseparable filters for the smoothing while the GLM uses one separable filter. For the GPU implementation the 2D smoothing can be done extremely fast by using the shared memory.
To verify that the whitening procedure prior to the permutations works correctly, the LjungBox test was applied to each residual time series. The LjungBox test was applied to the four datasets after detrending, BOLD removal, and whitening with AR models of different order. The test was applied for 1–10 time lags (i.e., 10 tests), and the mean number of nonwhite voxels was saved. A voxelwise threshold of
Mean number of voxels classified as nonwhite by the LjungBox test (1–10 time lags were tested and the mean number of nonwhite voxels for the 10 tests was saved). Prior to the LjungBox test the estimated auto correlations were spatially smoothed. The number of nonwhite voxels for Gaussian white noise is always zero.
Mean number of voxels classified as nonwhite by the LjungBox test (1–10 time lags were tested and the mean number of nonwhite voxels for the 10 tests was saved). No spatial smoothing was applied to the estimated auto correlations prior to the LjungBox test. The number of nonwhite voxels for Gaussian white noise is included as reference (no whitening was applied to the noise).
If no smoothing is applied to the estimated auto correlations prior to the LjungBox test, the test statistic cannot be trusted as the standard deviation of the estimated auto correlations is too high. The reason why the number of nonwhite voxels increases, when no smoothing is applied to the auto correlations and when the degree of the AR model increases, is that the critical threshold of the LjungBox test decreases as the order of the AR model increases.
From the results in Figures
For all the datasets an individual AR(4) whitening was therefore used prior to the permutations and in each permutation to generate new null data. For the
To verify that our random permutation test works correctly, all the preprocessing steps were removed and Gaussian white noise was used as data. The stimulus paradigm convolved with the HRF and its temporal derivative were used as regressors, and a
A comparison of familywise error rates for Gaussian white noise for three different approaches to calculate an activity threshold, corrected for multiple testing.
With the random permutation test it is possible to calculate corrected
For RCCA there is no theoretical distribution to calculate a threshold from, and therefore the corrected thresholds for the restricted canonical correlation coefficients are also presented, 10 000 permutations were used to calculate each threshold. Figure
Canonical correlation thresholds, for corrected
As the null distribution of the maximum
Figure
A comparison of
Figure
The estimated null distribution of the maximum
As a final result, distributions of the corrected
The distribution of the corrected
The distribution of the corrected
If 1000 permutations are used (and it is assumed that the estimated distribution is correct), the estimated
These plots would have taken a total of about 17.4 and 174 days to generate with a standard
With the help of fast random permutation tests it is possible to objectively evaluate activity maps from any test statistics easily, by using the same significance level. As an example of this we compare activity maps from GLM and CCA. It is also possible to investigate how a change in the preprocessing (e.g., the amount of smoothing or the whitening applied) affects the distribution of the maximum test statistics. The search for the best test statistics, that gives the best separation between active and inactive voxels, can now be started. To use simple test statistics and hope that the data is normally distributed and independent is no longer necessary.
As can be seen in the tables, a lot of time is saved by using the GPU. Most of the time is saved in the smoothing step. The tables clearly show that the GPU, or an advanced PC cluster, is a must for random permutation tests that include smoothing. To do 100 000 permutations with CCA takes about 7 days with the
It should be noted that these processing times are for 80 volumes and 20 000 brain voxels, but it is not uncommon that an fMRI dataset contains 150 volumes and 30 000 brain voxels, which triples the processing times.
The main problem with a long processing time is the software development. In order to test and verify that a program works correctly, the program has to be launched a large number of times. During the development of the routines and the writing of the paper we ran the complete analysis, with 1000–100 000 permutations, at least 3000 times. For the GLM this means that at least 6000 hours of processing time were saved, compared to the
With the power of the GPU it is even possible to look at the distributions of the corrected thresholds that otherwise could take as much as 6 months of processing time to estimate.
The processing time for 10 000 permutations with GLM and smoothing is about 3.5 minutes with one GPU. This is perhaps too long for clinical applications, but we believe that it is fast enough for researchers to use it in their daily work.
With the help of the GPU it is finally possible to compare activity maps from GLM and CCA at the same significance level. Even if CCA has a superior detection performance compared to the GLM, its use has been limited. One major reason for this is that it is hard to set a (corrected) threshold for CCA.
The presented activity maps show that the CCA approach in general, due to its spatial adaptivity, finds a higher number of significantly active voxels than the GLM approach. With 2D smoothing CCA finds some significantly active voxels in the left motor cortex and in the left somatosensory cortex that are not detected by the GLM. With 3D smoothing CCA finds some significantly active voxels in the left somatosensory cortex that are not detected by the GLM. We have thereby confirmed previous results that fMRI analysis by CCA can result in a higher number of significantly active voxels [
It might seem strange that the corrected canonical correlation thresholds do not decrease as rapidly as the corrected
The corrected thresholds are lower for 3D CCA than for 2D CCA. This is explained by the fact that the 2D version is adaptive in both scale and orientation, and it can thereby find higher correlations than the 3D version that only is adaptive in scale. With more advanced GPUs, the original 3D CCA approach, with 7 filters, can be used to obtain more spatial adaptivity in 3D.
The comparison between the thresholds from Bonferroni correction, Gaussian random field theory, and the random permutation test shows some interesting results. The thresholds from the random permutation test are the highest. For the GLM approach to be valid, the data is assumed to be normally distributed as well as independent. For the multiple testing problem, the parametric approaches also assume a common null distribution for each voxel while the permutation approach does not [
As a
The distribution of the
It is commonly assumed that the noise in MRI is normally distributed, but due to the fact that only the magnitude of the MRI data is used, the noise is actually Rician distributed [
Another problem for the random field theory approach is that the activity map has to be sufficiently smooth in order to approximate the behaviour of a continuous random field. The smoothness also has to be estimated from the data and it is assumed that it is constant in the brain. These assumptions and several others [
In this paper we have only described what is known as a
The GPU can of course also be used to speed up permutation tests for multisubject fMRI and multisubject PET, and not only for singlesubject fMRI. The only drawback with the GPU that has been encountered so far is that some test statistics, like 3D CCA, are harder to implement on the GPU than on the CPU, due to the current limitations of the GPU. It must also be possible to calculate the test statistics in parallel, otherwise the GPU will not provide any speedup.
We have presented how to apply random permutation tests for singlesubject analysis of fMRI data by using the graphics processing unit (GPU). Our work enables objective evaluation of arbitrary methods for singlesubject fMRI analysis. As a pleasant side effect, the problem of multiple testing is solved in a way that significantly reduces the number of necessary assumptions. To our knowledge, our implementation is the first where the smoothing is done
This work was supported the Linnaeus center CADICS, funded by the Swedish research council. The fMRI data was collected at the Center for Medical Image Science and Visualization (CMIV). The authors would like to thank the NovaMedTech project at Linköping University for financial support of our GPU hardware and Johan Wiklund for support with the CUDA installations.