High Performance Implementation of 3D Convolutional Neural Networks on a GPU

Convolutional neural networks have proven to be highly successful in applications such as image classification, object tracking, and many other tasks based on 2D inputs. Recently, researchers have started to apply convolutional neural networks to video classification, which constitutes a 3D input and requires far larger amounts of memory and much more computation. FFT based methods can reduce the amount of computation, but this generally comes at the cost of an increased memory requirement. On the other hand, the Winograd Minimal Filtering Algorithm (WMFA) can reduce the number of operations required and thus can speed up the computation, without increasing the required memory. This strategy was shown to be successful for 2D neural networks. We implement the algorithm for 3D convolutional neural networks and apply it to a popular 3D convolutional neural network which is used to classify videos and compare it to cuDNN. For our highly optimized implementation of the algorithm, we observe a twofold speedup for most of the 3D convolution layers of our test network compared to the cuDNN version.

3D convolutional neural networks have recently come to the attention of the scientific community. In [12], a database for 3D object recognition named ObjectNet3D is presented. The database focuses on the problem of recognizing the 3D pose and the shape of objects from 2D images. Another repository of 3D CAD models of objects is ShapeNet [13]. In [14], the authors propose VoxNet, a 3D convolutional neural network, to solve the robust object recognition task with the help of 3D information, while the authors of [15] propose a 3D convolutional neural networks for humanaction recognition.
In the light of these successful applications, it is worthwhile to explore new ways of speeding up the 3D convolution operation. In this paper we do so by deriving the 3D convolution forms of the minimal filtering algorithms invented by Toom and Cook [16] and generalized by Winograd [17]. Our experiments show this algorithm to be very efficient in accelerating 3D convolutional neural network in video classification applications.

Related Work
Many approaches aim to directly reduce the computational cost within CNN. In [18], the authors analyse the algebraic properties of CNNs and propose an algorithmic improvement to reduce the computational workload. They achieve a 47% reduction in computation without affecting the accuracy. In [19], convolution operations are replaced with pointwise products in the Fourier domain, which can reduce the amount of computation significantly. Reference [20] evaluates two fast Fourier transform (FFT) convolution implementations, one based on Nvidia cuFFT [21] and the other based on Facebook's FFT implementation. The FFT method can achieve an obvious speeding up of performance when the filter size is large, and the disadvantage of the FFT 2 Computational Intelligence and Neuroscience method is that it consumes much more memory than the standard method.
In [22], the authors use WMFA (Winograd Minimal Filter Algorithm) [17] to implement the convolution operation. In theory, fewer multiplications are needed in the WMFA, while not much extra memory is needed. WMFA is easy to parallelize; Lavin and Gray [22] implemented the algorithm on GPU, and they achieved better performance than the fastest cuDNN library. In [23], the authors show a novel architecture implemented in OpenCL on an FPGA platform; the algorithm they use to do the convolution is WMFA, which significantly boosts the performance of the FPGA. However, both works implemented 2D convolutional neural networks.
In this paper, we make four main contributions. Firstly, we derive the 3D forms of WMFA and design detailed algorithm to implement 3D convolution operation based on 3D WMFA. Secondly, we analyse the arithmetic complexity of 3D WMFA and prove 3D WMFA method can reduce computation in theory. Thirdly, we implement 3D WMFA for GPU platform and propose several optimization techniques to improve the performance of 3D WMFA. Finally, we evaluate the performance of 3D convolutional neural networks based on several implementations and prove the advantage of our proposed 3D WMFA method.

Preliminary: 3D Convolutional Neural Networks.
For the 2D convolution, kernels have fixed width and height, and they are slid along the width and height of the input feature maps. For the 3D convolution, both feature maps and kernels have depth dimension, and the convolution also needs to slide along the depth direction. We can compute the output of a 3D convolutional layer using the following formula: where , , , , represents the result of a convolution operation at the th channel feature and , , + , + , + is one of the input features, while , , , , is one of the filters. Equation (1) represents a direct convolution method, which requires intensive computability. The detailed arithmetic complexity of this method is shown in Section 3.3.

3D WMFA.
We introduce a new, fast algorithm to compute a 3D convolutional layer. The algorithm is based on WMFA. In order to introduce the 3D WMFA, firstly, we will give a simple introduction to the 1D WMFA. WMFA computes output with a tile size of each time; we use ( , ) to represent the output tile and is the filter size. According to the definition of convolution, 2 × 3 = 6 multiplications are required to compute (2, 3), but we can reduce the number of multiplications to do the convolution if we use the following WMFA: (3) The number of multiplications needed is ( (2, 3)) = 2 + 3 − 1 = 4; however, four additions are needed to transform the input image, three additions to transform the filter, and four additions to transform the result of the dot product. We can use a matrix form to represent the computation: We call the , , and transform matrices, and the values of the transforming matrices are In (4), = [ 0 1 2 3 ] and = [ 0 1 2 ] represent the input tile and filter tile, respectively. As described in [22], the format of the 2D WMFA is as follows: where is the filter with size × and is the image with size ( + − 1) × ( + − 1). To compute (2 × 2, 3 × 3), we need 4 × 4 = 16 multiplications; however, 4 × 9 = 36 multiplications are needed according to the convolution definition. Therefore, 2D WMFA can reduce the number of multiplications by a factor of 36/16 = 2.25 at the cost of increasing 32 additions in the data transformation stage, 28 floating point instructions at the filter transformation stage, and 24 additions at the inverse transformation stage. For a convolutional layer, the number of input channels and number of output channels are large, which means the input channels need to convolve different filters, so the transformed input tile can be reused as many times as the number of output channels. Each filter needs to be slid in and direction of input channel during convolution, so each transformed filter is reused as many times as the number of subtiles of input channel. And since the output tile is reduced along the input channels, the inverse transformation is done after reduction; then the number of inverse transformation is determined by the number of output channels. Therefore, the cost of data transformation stage, filter transformation stage, and the inverse transformation stage keep low in real convolutional layer implementation.
We can also apply the 3D WMFA to 3D convolution. To compute (2 × 2 × 2, 3 × 3 × 3), we apply the 3D Winograd transformation to the input tile and filter tile and apply 3D Winograd inverse transformation to the dot product of the transformed input image tile and the transformed filter tile. Algorithm 1 is a general form of the 3D Winograd transformation. In the algorithm, is the transformation matrix; the transformation matrix can be applied to transform the filter tile or applied to transform the input image tile. The dot product of the transformed input image tile and transformed filter tile will be accumulated along the channels, which can be converted to a matrix multiplication similar to the description in [22] , , , , = Consider the sum The previous equation can be divided into several submatrix multiplications; assume the output tile size is ( , , ]), using new coordinates ( ,̃,̃,̃) to replace ( , , , ), yielding This equation represents the matrix multiplication, and it can be simplified as follows: Algorithm 2 gives the overview of the 3D WMFA. The algorithm mainly consists of four stages, which are Winograd transformation of the input feature tile; Winograd transformation of the filter tile; the matrix multiplication, which is converted from the dot product of the transformed input tile and the transformed filter tile; and the inverse Winograd transformation of the result of the matrix multiplication.

Arithmetic Complexity Analysis.
For input feature maps with size × × × × , filters with size × × × × , and the output features with size × × × × , the total number of float operations in the multiplication stage can be represented as follows: where is the filter size and is the size of the output subtile. However, if we use the direct convolution method, which is computed according to the definition of convolution, the total number of float operations is computed as follows: Dividing 1 by 2 yields = ⌈ / ⌉⌈ / ⌉⌈ / ⌉ is the number of image tiles. = + − 1 is the input tile size. Neighbouring tiles overlap by − 1.

Implementation and Optimizations
4.1. Implementation. We have three implementation versions for the 3D WMFA on a GPU. In our implementations, cuBLAS is called to do the multiplication. Furthermore, we manually implement six kernels according to Algorithm 2. Figure 1 shows the flow of our baseline implementation. The imageTransform kernel transforms all the image subtiles, the filterTransform kernel transforms all the filter tiles, and ouputTransform kernel inversely transforms the result of the multiplication. For the baseline implementation, we also have two additional kernels to reorganize the transformed image data and transformed filter data and one kernel to reorganize the result of the multiplication before the results go to the outputTransform kernel.
Winograd transformation algorithm is suitable for parallelization on GPU. Taking the imageTransform kernel as an example, the input to the imageTransform kernel is the input feature map. We assume the input feature maps have a size of × × × × , where is the batch size, is the number of input channels, and × × is the size of a single channel. As is described in Algorithm 2, the number of image tiles for each channel is , so the total number of image tiles is × ; all those image tiles can be transformed independently. For the baseline of the GPU implementation, the image data is stored in NCDHW order, and we set the number of threads in one block to be 32, each thread is responsible for processing Winograd transformation of one input subtile, and the number of blocks in the grid is set to (⌈ /32⌉, ⌈ / ⌉⌈ / ⌉⌈ / ⌉, ). We can make full use of the large-scale parallel processing units of a GPU when the number of blocks is large.
For the filterTransform kernel, there are × filter tiles; we still set the number of threads of one block to be 32 and the number of blocks to (⌈ /32⌉, , 1). Before we call cuBLAS to implement the matrix multiplication, we need to reorganize the transformed filter and transformed image data. For transformed filter, there are × tiles in total, and each tile has size of × × , each time we gather one value from one tile to generate a submatrix of size × . our baseline implementation. They gather correlated and transformed filters and transformed image data to form two submatrices, and SGEMM from the cuBLAS library is called to do the multiplication. The result of the multiplication also needs to be reshaped before the inverse transformation.

Optimizations.
We make two optimizations to achieve higher performance. The first optimization is to align memory access, which can make memory access more efficient and increase the cache hit rate. The second optimization is to combine the transformation kernel with the reshape kernel to reduce global memory access.
The storing order of data and how data is accessed by a thread can significantly affect the performance. For the baseline implementation, the image data is stored in NCDHW order, and threads in the same block access data along the -dimension, which means data accessed by threads keeps long distances. However, the size of the cache on a GPU is limited; therefore, if the distance between the data items accessed by threads is larger than the size of the cache line, then each thread needs to access global memory separately, causing lots of memory accesses. In our first optimization version, we change the storage order of image data to CDHWN. Since the image data is stored starting from the N-dimension, data accessed by all threads in the same block is continually stored, and all data items loaded from the global memory are useful, which means the bandwidth is fully used. The same optimization method is applied to the filter transformation and inverse transformation kernel.
Based on the first optimization, we apply our second optimization to improve performance further. The second optimization is to reduce the number of global memory accesses. For the baseline implementation, after the filter transformation or image transformation kernel is executed, the transformed filter or transformed image data need to be stored in a new layout using reshapeTransformedFilter and reshapeTransformedImage kernel and using the ReshapeOutput kernel before the inverse transformation, while on our second optimized version, we move the work of the reshape kernel to the transformation kernel, so in the optimized transform kernel, after the inputs are transformed, the result will be stored directly back in the expected layout. In the optimized inverse transformation, before inverse transformation, the required data is gathered directly from the global memory.

Experimental Setup.
All experiments are evaluated on a GeForce GTX 1080 GPU, which has a total amount of 8 GBytes global memory and has 20 multiprocessors. Detailed parameters of the GTX 1080 are shown in Table 1.

Performance Evaluation.
We apply our 3D WMFA to a widely used 3D neural network called v3d [9], which is used to classify videos. The 3D neural network has five convolutional layers; Table 2 shows the information about these 3D convolutional layers. · · · · · · · · · · · · · · · · · · · · · · · · . . . . . . , N) (64 * M Figure 2: For an input matrix, its size is ( , * * * ); is the tile size, here equal to 4. After the reshape kernel is applied, lots of small submatrices with new layouts are generated. Firstly, we evaluate the performance of our three implementations on the 3D convolutional layers, except for the first convolutional layer, which has only three input channels and is not yet supported in the algorithm. Figure 3 shows the increase in speed we achieved after we used two optimizations. For the first optimization, we observe a 3 to 4 times speeding up for all these test convolution layers compared to the baseline implementation. However, for the second optimization, the maximum speeding up can be close to 42 for the third convolution; even the minimum speeding up is about 13 for the last convolution layer. The first optimization makes memory access more efficient, and the second optimization reduces lots of unnecessary global memory accesses. Since the latency of global memory access on a GPU is large, we achieve a good performance improvement in the second optimization.
We explore the detailed performance for one specific convolution layer to see how these two optimizations improve   deep-learning library. There are two algorithms available to implement a 3D convolution layer: one converts the convolution to a matrix multiplication and the other exploits the FFT tiling method to implement the convolution. We use cuDNN SGEMM and cuDNN FFT Tiling to represent the two methods called in the cuDNN library, and we use 3D WMFA to represent our algorithm. Figure 5 shows the execution time of these three methods on four convolution layers. The 3D WMFA method is about 30% slower than the cuDNN FFT Tiling method on the conv2 layer; however, it is a bit faster than cuDNN SGEMM method. The cuDNN FFT Tiling method achieves a fast speed at the cost of consuming a large amount of memory. Since parameters and are not large in the conv2 layer, this makes the matrix multiplication in 3D WMFA on small scale, which affects the performance. However, with parameters and increased on layers conv3, conv4, and conv5, the 3D WMFA method achieves better performance than the other two methods. We added the execution time of each layer for each method, the total time of all layers is 132.6 ms for cuDNN SGEMM method, 135.4 ms for cuDNN FFT Tiling method, and 108.2 ms for 3D WMFA which is better than the other two methods.
We can also calculate the performance of these two methods in TFLOPS. Table 3 shows the effective TFLOPS of cuDNN SGEMM and 3D WMFA method. We achieve a maximum speedup of 1.96 compared to cuDNN SGEMM.

Conclusions
A 3D convolution layer requires a high computational cost and consumes lots of memory. We designed a 3D WMFA 8 Computational Intelligence and Neuroscience to implement 3D convolution operation. Compared to traditional convolution methods, such as SGEMM or FFT, the 3D WMFA can reduce computation, in theory. When we implemented the algorithm on a GPU, we observed the expected performance of the algorithm in the experiments. For some 3D convolution layers, we even achieve close to 2 times speedup compared to cuDNN library.
However, the computation and memory requirements of 3D convolution obviously increase with more complex 3D neural networks. In our future work, we will implement (4× 4 × 4, 3 × 3 × 3) to reduce the computation further to ease the intensive computation problem and adopt a FP16 data type to compute, which can save half of the memory usage directly. It is also necessary to parallel the convolution computation among multi-GPUs or multinodes.