Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration

Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-basedMontgomerymultiplication/exponentiation algorithm andChinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floatingpoint-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.


Introduction
With the rapid development of e-commerce and cloud computing, the high-density calculation of digital signature and asymmetric cryptographic algorithms such as the Elliptic Curve Cryptography (ECC) [1,2] and RSA [3] algorithms is urgently required. However, without significant development in recent years, CPUs are more and more difficult to keep pace with such rapidly increasing demands. Specialized for the compute-intensive and high-parallel computations required by graphics rendering, GPUs own much more powerful computation capability than CPUs by devoting more transistors to arithmetic processing unit rather than data caching and flow control. With the advent of NVIDIA Compute Unified Device Architecture (CUDA) technology, it is now possible to perform general-purpose computation on GPUs. Due to their powerful arithmetic computing capability and moderate cost, many researchers resort to GPUs to perform cryptographic acceleration.
Born for high-definition 3D graphics, GPUs require highspeed floating-point processing capability; thus the floatingpoint units residing in GPUs achieve dramatical increase. From 2010 to the present, floating-point computing power of CUDA GPUs grows almost 10 times, from 1,345/665.6 (Fermi architecture) Giga Floating-point Operations Per Second (GFLOPS) to 10,609/5304 (Pacal architecture) GFLOPS for single/double-precision floating-point arithmetic. By contrast, integer multiplication arithmetic of NVIDIA GPUs gains only 25% performance improvement from Fermi to Kepler architecture; the latest Maxwell and Pascal architectures even do not provide dedicated device instructions for 32-bit integer multiply and multiply-add arithmetic [4].
However, the floating-point instruction set is inconvenient to realize large integer modular multiplication which is the core operation of asymmetric cryptography. More importantly, the floating-point instruction set in the previous GPUs shows no significant performance advantage over the integer one. To the best of our knowledge, Bernstein et al. [5] are the first and the only one to utilize the floatingpoint processing power of CUDA GPUs for asymmetric cryptography. However, compared with their later work [6] based on the integer instruction set, the floating-point-based one only achieves about 1/6 performance. Nevertheless, with the rapid development of GPU floating-point processing power, fully utilizing the floating-point processing resource will bring great benefits to the asymmetric cryptography implementation in GPUs.
Based on the above observations, in this paper, we propose a new approach to implement high-performance RSA by fully exploiting the double-precision floating-point (DPF) processing power in CUDA GPUs. In particular, we propose a DPF-based large integer representation and adapt the Montgomery multiplication algorithm to it. Also, we flexibly employ the integer instruction set to supplement the deficiency of the DPF computing power. Besides, we fully utilize the latest shuffle instruction to share data between threads, which makes the core algorithm Montgomery multiplication a non-memory-access design, decreasing greatly the performance loss in the thread communication. Additionally, a method is implemented to apply the proposed DPF-based RSA decryption algorithm in practical scenarios where the input and output shall be in bit-format, improving further the overall performance by about 5%, by decreasing data transfer consumption using smaller data format and leveraging GPUs to promote the efficiency of data conversion.
With these improvements, in GTX TITAN, the performance of the proposed RSA implementation increases dramatically compared with the previous works. In particular, the experimental results of RSA-2048/3072/4096 decryption with CRT reach the throughput of 42,211/12,151/5,790 operations per second and achieve 13 times the performance of the previous floating-point-based implementation [5], and RSA-4096 decryption is 1.23 times the performance of the existing fastest integer-based one [7].
The rest of our paper is organized as follows. Section 2 introduces the related work. Section 3 presents the overview of GPU, CUDA, floating-point elementary knowledge, RSA, and Montgomery multiplication. Section 4 describes our proposed floating-point-based Montgomery multiplication algorithm in detail. Section 5 shows how to implement RSA decryption in GPUs using our proposed Montgomery multiplication. Section 6 analyses performance of proposed algorithm and compares it with previous work. Section 7 concludes the paper.

Related Work
Many previous papers demonstrate that the GPU architecture can be used as an asymmetric cryptography workhorse. For ECC, Antão et al. [8,9] and Pu and Liu [10] employed the Residue Number System (RNS) to parallelize the modular multiplication into several threads. Bernstein et al. [6] and Leboeuf et al. [11] used one thread to handle a modular multiplication with Montgomery multiplication. Pan et al. [12], Zheng et al. [13], Bos [14], and Szerwinski and Güneysu [15] used fast reduction to implement modular multiplication over the Mersenne prime fields [16].
Unlike ECC scheme, RSA calculation requires longer and unfixed modulus and depends on modular exponentiation. Before NVIDIA CUDA appeared on the market, Moss et al. [17] mapped RNS arithmetic to the GPU to implement a 1024-bit modular exponentiation. Later in CUDA GPUs, Szerwinski and Güneysu [15] and Harrison and Waldron [18] developed efficient modular exponentiation by both Montgomery multiplication Coarsely Integrated Operand Scanning (CIOS) method and the RNS arithmetic. Jang et al. [19] presented a high-performance SSL acceleration using CUDA GPUs. They parallelized the Separated Operand Scanning (SOS) method [20] of the Montgomery multiplication by single limb. Jeffrey and Robinson [21] used the similar technology to implement 256-bit, 1024-bit, and 2048bit Montgomery multiplication in GTX TITAN. Neves and Araujo [22] and Henry and Goldberg [23] used one thread to perform single Montgomery multiplication to economize the overhead of thread synchronization and communication; however, their implementations resulted in a very high latency. Emmart and Weems [7] applied one thread to a row oriented multiplication for 1024-bit modular exponentiation and a distributed model based on CIOS method for 2048bit modular exponentiation. Yang [24] used an Integrated Operand Scanning (IOS) method with single limb or two limbs for Montgomery multiplication algorithm to implement RSA-1024/2048 decryption.
Note that all above works are based on the CUDA integer computing power. Bernstein et al. are the pioneers to utilize CUDA floating-pointing processing power in asymmetric cryptography implementations [5]. They used 28 single precision floating-points (SPFs) to represent 280-bit integer and implemented the field arithmetic. However, the result was barely satisfactory. Their later work [6] in the same platform GTX 295 using the integer instruction set performed almost 6.5 times the throughput of [5].

Background
In this section, a brief introduction to the basic architecture of modern GPUs, the floating-point arithmetic, the basic knowledge of RSA, and the Montgomery multiplication are provided.
3.1. GPU and CUDA. CUDA is a parallel computing platform and programming model that enables dramatical increase in computing performance by harnessing the power of GPU. It is created by NVIDIA and implemented by the GPUs which they produce [4].
The target platform, GTX TITAN, is a CUDA-compatible GPU (codename GK-110) with the CUDA Compute Capability 3.5 [25], which contains 14 streaming multiprocessors (SMs). 32 threads (grouped as a warp) within one SM concurrently run in a clock. Following the Single Instruction Multiple Threads (SIMT) architecture, each GPU thread runs one instance of the kernel function. A warp may be preempted when it is stalled due to memory access delay, and the scheduler may switch the runtime context to another available warp. Multiple warps of threads are usually assigned to one SM for better utilization of the pipeline of each SM. These warps are called one block. Each SM has 64 KB on-chip memory which can be configured as fast shared memory and L1 cache; the maximum allocation of shared memory is 48 KB [25]. Each SM also possesses 64 K 32-bit registers. All SMs share 6 GB 256-bit wide slow global memory, cached readonly texture memory, and cached read-only constant memory [4,25]. Each SM of GK-110 owns 192 single precision CUDA cores, 64 double-precision units, 32 special function units (SFUs), and 32 load/store units [25], yielding a throughput of 192 SPF arithmetic, 64 DPF arithmetic, 160 32-bit integer add, and 32 32-bit integer multiplication instructions per clock circle [4,25].
NVIDIA GPUs of Compute Capability 3.x or later bring a new method of data sharing between threads. Previously, shared data between threads requires separated store and load operations to pass data through shared memory. First introduced in the NVIDIA Kepler architecture, shuffle instruction [4,25] allows threads within a warp to share data. With shuffle instruction, threads within a warp can read any value of other threads in any imaginable permutations [25]. NVIDIA conducted various experiments [26] on the comparison between shuffle instruction and shared memory, which show that shuffle instruction always gives a better performance than shared memory. This instruction is available in GTX TITAN.

Integer and Floating-Point Arithmetic in GPU.
NVIDIA GPUs with CUDA Compute Capability 3.x support both integer and floating-point arithmetic instruction sets. This section introduces some most concerned instructions in asymmetric cryptographic algorithm implementations, including add and multiply(-add) operations.
Integer. Integer arithmetic instructions add.cc, addc, mad.cc, and madc are provided to perform multiprecision add and multiply-add operations, which reference an implicitly specified condition code register (CC) having a single carry flag bit (CC.CF) holding carry-in/carry-out [27]. With this native support for multiprecision arithmetic, most of the previous works [6, 8-11, 13-15, 18, 19, 21-23] chose to use integer instructions to implement large integer arithmetic in asymmetric cryptographic algorithms.
One noteworthy point is that multiply or multiply-add instructions of NVIDIA GPUs with Compute Capability 3.x have a unique feature: when calculating the 32-bit multiplication, the whole 64-bit product cannot be obtained using single instruction but requires two independent instructions (one is for lower-32-bit half and the other for upper-32bit half). Although the whole multiplication "instruction" (mul.wide) is provided which is used in [22,23], it is a virtual instruction but not the native instruction, which is  [28]. Among five basic formats which the standard defines, 32-bit binary (i.e., SPF) and 64-bit binary (i.e., DPF) are supported in NVIDIA GPUs. As demonstrated in Table 1, the real value assumed by a given SPF or DPF data with a sign bit sign, a given biased exponent exp, and a significand precision mantissa is (−1) sign × 2 exp−biase × 1.mantissa. Therefore, each SPF or DPF can, respectively, represent 24-bit or 53-bit integer. Add and multiply(-add) operation instructions for floating-point are natively supported. In particular, floatingpoint multiply-add operation is always implemented by fused multiply-add (fma) instruction, which is executed in one instruction with single rounding.
Unlike integer instructions, the floating-point add or multiply-add instructions do not support carry flag (CF) bit. When the result of the add instruction is beyond the limit bits of significand (24 for SPF, 53 for DPF), the roundoff operation happens, in which the least significant bits are left out to keep the significand within the limitation. This round-off causes precision-loss, which is intolerable in cryptographic calculation. Thus in algorithm design, all operations should be carefully scheduled to avoid it. Table 2 compares the computing power of integer, SPF and DPF in the target platform GTX TITAN. It is found that integer is more advantageous in add operation than floatingpoint, but the multiply(-add) operation of DPF is 2.6 times the performance of integer. [3] is an algorithm widely used for digital signature and asymmetric encryption, whose core operation is modular exponentiation. And in practical scenarios, CRT [29] is widely used to promote the RSA decryption. Instead of calculating a 2 -bit modular exponentiation directly, two -bit modular exponentiations ((1a) and (1b)) and the Mixed-Radix Conversion (MRC) algorithm [30] (2) are sequently performed to conduct the RSA decryption:
private key [31]. Compared with calculating 2 -bit modular exponentiations directly, the CRT technology gives 3 times the performance promotion [29]. Even with CRT technology, the bottleneck restricting the overall performance of RSA lies in modular multiplication. In 1985, Montgomery proposed an algorithm [32] to remove the costly division operation from the modular reduction. Let = (mod ) and = (mod ) be the Montgomeritized forms of , modulo , where and are coprime and ≤ . Montgomery multiplication defines the multiplication between 2 Montgomeritized form numbers, MonMul( , ) = −1 (mod ). Even though the algorithm works for any which is relatively prime to , it is more useful when is taken to be a power of 2, which leads to a fast division by .
A series of improvement for the Montgomery multiplication got public since its foundation. In 1995, Orup [33] economized the determination of quotients by loosening the restriction for input and output from [0, ) to [0, 2 ). Algorithm 1 shows the detailed steps.

DPF-Based Montgomery Multiplication
We aim to implement 2 -bit RSA decryption with CRT introduced in Section 3.3; thus -bit Montgomery multiplication shall be implemented first. This section proposes a DPF-based Montgomery multiplication parallel calculation scheme directed on CUDA GPUs, including the large integer representation, the fundamental operations, and the parallelism method of Montgomery multiplication.

Advantages and Challenges of DPF-Based RSA.
DPF instruction set has a huge advantage in asymmetric cryptographic algorithm acceleration, as demonstrated in Table 2. Since RSA builds on large integer multiplication, such a huge advantage becomes a decisive point to compete with the integer instruction set.
However, it encounters many problems when exploiting its theoretical superior multiplication computing power.
(i) Round-Off Problem. Due to the round-off problem, every detail should be taken into careful consideration, which makes it very difficult and complicated to design and implement the algorithm.
(ii) Nonsupport for Carry Flag. Lack of support for carry flag makes it very inconvenient and inefficient to perform multiprecision add or multiply-add operation.
Instead of using only one carry-flag-supported integer instruction, the carry has to be manually handled via multiple add or multiply-add instructions.
(iii) Inefficient Add Instruction. From Table 2, it can be found that, unlike multiplication instruction, the DPF add instruction is slower than the integer add instruction. Moreover, nonsupport for carry flag makes it perform even worse in multiprecision add operation.
(iv) Inefficient Bitwise Operations. Floating-point arithmetic does not support bitwise operations which are frequently used. CUDA Math API does support thefmod function [34], which can be employed to extract the least and most significant bits. But it consumes tens of instructions which is extremely inefficient, while using integer native instructions set; the bitwise operation needs only one instruction.
(v) Extra Memory and Register File Cost. A DPF occupies 64-bit memory space; however, only 26 or lower bits are used. In this way, (64 − 26)/26 = 138% times extra cost in memory access and utilization of register files have to be overconsumed. In integerbased implementation, this issue is not concerned since every bit of an integer is utilized.
Before the introduction to the proposed algorithm, several symbols are defined in Table 3 for better reading of the following sections.

DPF-Based Representation of Large Integer.
In Montgomery multiplication, multiply-accumulation operation = × + is frequently used. In CUDA, fused multiply-add (fma) instruction is provided to perform floating-point multiplyadd operation.

Input:
> 2 with ( , 2) = 1, positive integers , such that 2 > 4 ; 2 , 0 ≤ < 2 and 0 ≤ < 2 ; Output: An integer such that = −1 mod 2 and 0 ≤ < 2 ; Algorithm 1: High-radix Montgomery multiplication without determination of quotients accordingly (CIOS).  When each DPF ( and ) contains ( ≤ 26) significant bits, (2 53 −1)/(2 −1) 2 times of = × + (the initial value of is zero) can be executed, free from the round-off problem. Note that, in Algorithm 1, there are loops and each loop contains 2 fma operations for each limb, where = ⌈( + 2)/ ⌉, where stands for the bit length of the modulus. Thus 2 × ⌈( + 2)/ ⌉ times of fma operations are needed in total. The following requirement should be met for , where so that the number of the supported fma surpasses the required ones. And the lower means more instructions are required to process the whole algorithm. From Formula (3), to achieve the best performance, is chosen as shown in Table 4.
In this contribution, two kinds of DPF-based large integer representations are proposed, Simplified format and Redundant format.
where each limb [ ] contains at most significant bits. It is applied to represent the input of the fma instruction.
where each limb [ ] contains at most 53 significant bits. It is applied to accommodate the output of the fma instruction.

Fundamental Operation and Corresponding Optimization.
In DPF-based Montgomery multiplication, the fundamental operations include multiplication, multiply-add, addition, and bit extraction.
(i) Multiplication. In the implementation, the native multiplication instruction (mul) is used to perform multiplication. It is required that both multiplicand and multiplier are in Simplified format to avoid round-off problem.
(ii) Multiply-Add. In CUDA GPUs, fma instruction is provided to perform floating-point = × + , which is executed in one step with a single rounding. In the implementation, when using fma instruction, it is required that multiplicand and multiplier are both in Simplified format and addend is in Redundant format but less than (2 53 − 2 46 ).
(iii) Bit Extraction. The algorithm needs to extract the most or least significant bits from a DPF. However, as introduced in Section 4.1, the bitwise operation for DPF is inefficient. Two attempts of improvements are made to promote the performance. The first one is introduced in [35]. Using round-to-zero, we can perform to extract the most significant bits and the least significant (53 − ) bits V from a DPF . Note that, in = + 2 52+53− , the least significant (53 − ) bits will be left out due to the round-off operation. The second one is converting DPF to integer then using CUDA native 32-bit integer bitwise instruction to handle bit extraction. Through the experiments, it is found that the second method always gives a better performance. Therefore, in most of cases, the second method is used to handle bit extraction. There is only one exception when the DPF is divisible by a 2 , DPF division /2 is used to extract the most significant 53− bits, which can be executed very fast.
(iv) Addition. CUDA GPUs provide the native add instruction to perform addition between two DPFs. But as aforementioned, it is inefficient and does not provide support for carry flag. Thus, DPF is first converted into integer; then CUDA native integer add instruction is used to handle addition.

DPF-Based Montgomery Multiplication Algorithm.
With reference to Algorithm 1, in the Montgomery multiplication, = −1 ( mod ), , , , and are all ( + 1)-bit integer (in fact, is bits long, and it is also represented as a ( + 1)bit integer for a common format). As choosing as limb size, = ⌈( + 1)/ ⌉ DPF limbs are required to represent , , , and .
In the previous works [19,21], Montgomery multiplication is parallelized by single limb; that is, each thread deals with one limb (32-bit or 64-bit integer). The one-limb parallelism causes large cost in the thread synchronization and communication, which decreases greatly the overall throughput. To maximize the throughput, Neves and Araujo [22] performed one entire Montgomery multiplication in one thread to economize overhead of thread synchronization and communication, however, resulting in a high latency, about 150 ms for 1024-bit RSA, which is about 40 times of [19].
To make a tradeoff between throughput and latency, in the implementation, we try multiple-limb parallelism, namely, using threads to compute one Montgomery multiplication and each thread dealing with limbs, where = ⌈ / ⌉. The degree of parallelism can be flexibly configured to offer the maximal throughput with acceptable latency. Additionally, we restrict threads of one Montgomery multiplication within a warp, in which threads are naturally synchronized free from the overhead of thread synchronization and shuffle instruction can be used to share data between threads. To fully occupy thread resource, shall be a divisor of the warp size (i.e., 32). In the proposed Montgomery multiplication = −1 (mod ), the inputs , , and are in Simplified format and the initial value of is 0. Two phases are employed to handle one Montgomery multiplication, Computing Phase and Converting Phase. In Computing Phase, Algorithm 2 is responsible to calculate , whose result is represented in Redundant format. Then in Converting Phase, is converted from Redundant format to Simplified format applying Algorithm 3. (2) = (( mod 2 ) × )(mod 2 ): mod 2 is only related to [0], which is stored in Thread 0 as [0]. Therefore, in this step, this calculation is only conducted in Thread 0, while other threads are idle. Note that is in Redundant format, and the least significant bits temp of [0] shall be firstly extracted before executing = temp × . And in next step, will act as a multiplier; hence then, the same bit extraction shall be also applied to . After Computing Phase, is in Redundant format. Next, we use Converting Phase to convert into Simplified format.

Converting Phase.
In Converting Phase, is converted from Redundant format to Simplified format: every [ ] adds the carry ( [0] does not execute this addition) and holds the least significant bits of the sum and propagates the most significant (53− ) bits as new carry to [ +1]. However, this method is serial, and the calculation of every [ ] depends on the carry that [ − 1] produces, which does not comply with the parallelism architecture of GPU. In practice, parallelized method is applied to accomplish Converting Phase, which is shown in Algorithm 3.
Algorithm 3 uses symbol split( ) = (ℎ, ) = (⌊ /2 ⌋, mod 2 ) to denote that 53-bit integer c is divided into (53 − ) most significant bits ℎ and least significant bits . Firstly, all threads execute a chain addition for its [0] ∼ [ − 1] and store the final carry. Then every Thread − 1 (except Thread ( − 1)) propagates the stored carry to Thread using shuffle instruction and then repeats chain addition with the propagated carry. This step continues until carry of every thread is zero, which can be checked by the CUDAany() voting instruction [4]. The number of the iterations is ( − 1) in the worst case, but for most cases it takes one or two. Compared with the serialism method, over 75% execution time would be economized in Converting Phase using the parallelism method.
After Converting Phase, is in Simplified format. An entire Montgomery multiplication is completed.

RSA Implementation
This section introduces the techniques on the implementation of Montgomery exponentiation, CRT computation, and pre-/postcomputation format conversion.

Montgomery Exponentiation.
As introduced in Section 3.3, RSA with CRT technology requires two -bit Montgomery exponentiation = (mod ) to accomplish 2 -bit RSA decryption. With the binary square-and-multiply method, the expected number of modular multiplications is 3 /2 for -bit modular exponentiation. The number can be reduced with -ary method given by [36] that scans multiple bits, instead of one bit of the exponent. Jang et al. [19] used sliding window technology Constant Length Nonzero Windows (CLNW) [37] to reduce the number of Montgomery multiplications further. But it is not suitable for our design, in which an entire warp may contain more than one Montgomery multiplication and each Montgomery multiplication has different exponentiation which leads to different scanning step size and execution logic. These differences would cause warp divergence, largely decreasing the overall performance. And -ary method is timing-attackproof, the calculation time of which would not be affected by bit pattern. For all the above reasons, we choose to employ -ary method not the CLNW method.
In -ary method, where = 2 V , the window size V shall be chosen properly to decline the number of modular multiplications ( +⌈ /V⌉+2 V ) and memory access (⌈ /V⌉+2 V ), and the window size V = 6 is employed as it offers the least computational cost and memory access. In this contribution, 2 6 -ary method is implemented and reduces the number of modular multiplications from 1536 to 1259 for 1024-bit modular exponentiation, achieving 17.9% improvement. In = 2 6 -ary method, (2 6 − 2) precompute table ( 2 ∼ 63 ) needs to be stored into memory for each Montgomery exponentiation. The memory space required (about 512 KB) is far more than the size of shared memory (at most 48 KB); thus they have to be stored into global memory. Global memory load and store operations consume hundreds of clock circles. To improve memory access efficiency, the Simplified format precompute tables are converted into -bit integer format using GPU format-convert instruction and then stored in global memory. This optimization economizes about 50% memory access consuming.

Input:
: Simplified-format Base number; : -bit Integer Exponent; : Simplified-format Modulus; : Radix, = 2 × ; Output: [ ]2 6 , then assign them into threads equally. In Montgomery exponentiation, the exponent is represented in integer. Similar to the processing of shared data in Montgomery multiplication = −1 mod , each thread stores a piece of and uses shuffle to broadcast from certain thread to all threads. Algorithm 4 shows how to conduct Montgomery exponentiation where =MonMul( , , ) means calculating = −1 mod using Algorithms 2 and 3.

CRT Computation.
In the first implementation, GPU only took charge of the Montgomery exponentiation. And the CRT computation (2) was offloaded to CPU using GNU multiple precision (GMP) arithmetic library [38]. But we find the low efficiency of CPU computing power greatly limits the performance of the entire algorithm, which occupies about 15% of the execution time. Thus we make attempt to integrate the CRT computation into GPU. For CRT computation, a modular subtraction and a multiply-add function are additionally implemented. Both functions are parallelized in the threads which take charge of Montgomery exponentiation. The design results in that the CRT computation occupies only about 1% execution time and offers the independence of CPU computing capability. The scheme is shown in Figure 1.    large integer representation is not consistent with it. Thus before and after applying the proposed DPF-based RSA algorithm, the conversion between DPF and 32-bit integer shall be conducted.

Pre-and Postcomputation Format
The first attempt was made to conduct the conversion in CPU, which is easy to implement, as shown in Figure 2(a): before and after GPU kernel, converting its DPF-format input and output from or to bit-string using bitwise and format conversion instructions of CPU. However, this strategy has two significant drawbacks.
Secondly, converting data format in CPU is inefficient. The simplest way is using serial operations to convert before and after GPU kernel in CPU. For example, we complete 896 RSA-2048 transformations which take over 1.1 ms using CPU's (Intel E5-2697 v2) single core in the experiment. But for GPU, it is adept in accomplishing conversion with thousands threads, and the GPU threads can be used to accomplish such amount of RSA in about 0.1 ms in GTX TITAN.
With full consideration for above two drawbacks, we turn to transfer data between GPU and CPU using integer rather than DPF, and in GPU kernel × 2 threads for each RSA decryption accomplish data conversion before and after RSA decryption calculation, as shown in Figure 2(b).
Precomputation Conversion. First, CPU transfers the 32-bitinteger-format inputs to global memory of GPU. According to its own DPFs needed to process in Algorithms 2 and 3, each CUDA thread extracts bits from the corresponding 32-bit integers, reconstructs them in -bit-integer format using bitwise instructions, and finally converts -bit-integerformat integer into Simplified format DPFs using integer-to-DPF instruction as the input of the proposed algorithm.
Postcomputation Conversion. The postcomputation conversion is a reverse procedure of the precomputation conversion, converting the Simplified format outputs into 32-bit-integerformat, then returning them to CPU. This method largely reduces the cost of data transfer and leverages the computing power of GPUs. For 2048-bit RSA decryption, the overall latency decreases by up to 1.2 ms (5%).

Performance Evaluation
This section presents the implementation performance and summarizes the results for the proposed algorithm. Relative assessment is also presented by considering related implementation. The hardware and software configuration used in the experiment are listed in Table 5.

Experiment Result.
Applying the DPF-based Montgomery multiplication algorithm and RSA implementation, respectively, described in Sections 4 and 5, RSA-2048/3072/4096 decryptions are implemented in NVIDIA GTX TITAN, respectively.
Several configuration parameters may affect the performance of the kernel, including the following:  Note that the configuration Batch Size is chosen from a small number to the hardware limitation. Figure 3 summarizes the impact of Batch Size on the performance, which indicates the larger Batch Size can always lead to the better performance. It is easy to understand, because GPUs pipeline the instructions to leverage instruction-level parallelism [4], and configuring as large Batch Size as possible can keep the pipeline busy most of time and fully utilize the computing resource in GPUs, thereby yielding better performance. As the throughput is the most concerned factor in this contribution, Table 6 chooses the maximum Batch Size supported by the GPU hardware for assessment (half of the maximum Batch Size is also provided for comparison), although some lower configurations of Batch Size bring a lower latency with an almost equivalent throughput.
Another important factor on performance is Threads/ RSA. In Table 6, Column Threads/RSA = × 2 indicates threads are used to process a Montgomery multiplication and × 2 threads to process one RSA decryption. As discussed in Section 4.4, theoretically, parallelizing single computing task into too many threads would result in poor throughput; meanwhile, parallelism of too few threads may lead to a high latency. Towards the practical usage and also emphasizing a high throughput, in the experiment, on the premise of an acceptable latency, as few Threads/RSA as possible are applied to offer a high throughput.
However, even without the consideration for acceptable latency, at some points, due to the hardware limitation for register file per thread, parallelism using too few threads may not promote but even largely decrease the overall throughput. An example of the experiment is RSA-4096 with Threads/RSA = 4 × 2 as shown in Table 6. For RSA-2048, the throughput of Threads/RSA = 4 × 2 always precedes Threads/RSA = 8 × 2. However, for RSA-4096 with Threads/RSA = 4 × 2, the growing number of variables required exceeds tremendously the hardware limitation (when Batch Size is 14 × 64, 127 registers per thread in GTX TITAN); thus many variables have to spill into off-chip local memory which is hundreds times slower than registers. In this way, the throughput is much less than Threads/RSA = 8 × 2. In fact, the proposed DPF-based algorithm is more sensitive for the number of available registers since it consumes more register file than normal integer-based one as specified in Section 4.1.
To summarize, for best practice of the proposed DPFbased algorithm, many factors should be comprehensively taken into consideration to suit both properties of GPU hardware and RSA modulus size, especially Threads/RSA and Regs/Thread with higher and higher requirement of RSA modulus size (7680 bits and more).   448  420  392  364  336  308  280  252  224  196  168  140  112  84  56  28  448  420  392  364  336  308  280  252  224  196  168  140  112  84  56 Table 7 shows performance of the work [5] and ours. Note that the work [5] only implemented the 280-bit modular multiplication; thus its performance is scaled by (280/1024) 2 as the performance of the 1024-bit one, and it is further scaled by the difference of the floating processing power of the corresponding GPU. The scaled result is shown in the row "1024-bit MulMod (scaled)." Table 7 demonstrates the resulting implementation achieves 13 times speedup compared to the performance of [5]. Part of the performance promotion results from the advantage DPF achieves over SPF as discussed in Section 3.2. The reason why they did not utilize DPF is that GTX 295 they used does not support DPF instructions. The second reason of the advantage comes from the process of Montgomery multiplication. Bernstein et al. used Separated Operand Scanning (SOS) Montgomery multiplication method which is known inefficient [20]. And they utilized only 28 threads of a wrap (consisting of 32 threads), which wastes 1/8 processing power. The third reason is that we used the CUDA latest shuffle instruction to share data between threads, while [5] used shared memory. As Section 3.1 introduced, the shuffle instruction gives a better performance than shared memory. The last reason lies in that Bernstein et al. [5] used floatingpoint arithmetic to process all operations, some of which are more efficient using integer instruction such as bit extraction. By contrast, we flexibly utilize integer instructions to accomplish these operations.

Proposed versus Integer-Based Implementation.
Previous works [7,19,[21][22][23] are all integer-based. Thus we scale their CUDA platform performance based on the integer processing power. The parameters in Table 8 origin from [4,40], but the integer processing power is not given directly. Taking SM number, processing power of each SM, and Shader Clock into consideration, we calculate integer multiplication instruction throughput Int Mul. Among them, 8800 GTS and GTX 260 support only 24-bit multiply instruction, while the other platforms support 32-bit multiply instruction. Hence, we adjust their integer multiply processing capability by a correction parameter (24/32) 2 (unadjusted data is in parenthesis). Throughput Scaling Factor is defined as Int Mul ratio between the corresponding CUDA platform and GTX TITAN. And Throughput Scaling Factor is also defined, as the Shader Clock ratio. Table 8 summarizes the resulting performance of each work. We divide each resulting performance by the corresponding Scaling Factor listed in Table 8 as the scaled performance. Note that the RSA key length of Neves and Araujo [22] is 1024 bits, while ours is 2048 bits, and we multiply it by an additional factor 1/4 × 1/2 = 1/8 (1/4 for the performance of modular multiplication, 1/2 for the half bits of the exponent).
From Table 8, at the modular multiplication level, the proposed implementation outperforms the others by a great margin for floating-point-based RSA. We achieve nearly 6 times speedup compared to the work [23], and even at the same CUDA platform, we obtain 221% performance of the work [21].
At RSA implementation level, the proposed implementation also shows a great performance advantage, 291% performance of the work [22] for RSA-2048. Note that RSA-1024 decryption of [22] has latency of about 150 ms, while 2048bit RSA decryption of ours reaches 21.52 ms when it reaches the throughput peak. Yang [24] reported the latency for RSA-2048 throughput of 5,244 (scaled as 31,782) is 195.27 ms (scaled as 225.60 ms). The throughput of the proposed RSA-2048 implementation is 132% performance of Yang's work, but the latency is 9.4% of their work [24]. Our peak RSA-2048 throughput is slightly slower than the fastest integer-based implementation [7], while our latency is 3 times faster. For RSA-4096 decryption, Emmart and Weems use distributed model [7] based on CIOS method with distributed integer values to report a throughput of 5257 RSA-4096 decryption on a GTX780Ti, scaling the performance to a GTX TITAN is 4,693, and our work is 1.23 times faster.
The performance advantage lies mainly in the utilization of floating-point processing power and the superior handling of Montgomery multiplication, which overcomes the problems addressed in Section 4.1. The DPF-based representation of large integer and Montgomery multiplication is carefully scheduled to avoid round-off problem and decrease largely the number of the expensive carry flag handling processes. For the inefficient add instruction and bitwise operations of DPF, the integer instruction set is flexibly employed to supplement the deficiency. And precompute table storage optimization (in Section 5.1) and the design of pre-and postcomputation format conversion (in Section 5.3) mitigate the performance influence brought by extra memory and register file cost.
Besides, compared with the works using multiple threads to process a Montgomery multiplication [19,21], another vital reason is that we use more efficient shuffle instruction to handle data sharing instead of shared memory and employ more reasonable degree of thread parallelism to economize the overhead of thread communication.
6.3. Discussion. In 2016 GPU Technology Conference, NVIDIA released the latest NVIDIA's Tesla P100 accelerator using the Pascal GP100 which has a huge improvement in floating-point computing power, especially DPF. Tesla P100 can deliver 5.3 TFLOPS of DPF performance which is 3.4 times our target platform [41]. Meanwhile the number of the registers for each CUDA core is doubled [41], which is essential for DPF-based RSA implementation. The performance of proposed DPF-based algorithm can be improved by at least three times based on all above improvements, which will further widen the gap between the DPF-based and the traditional integer-based algorithms.

Conclusion
In this contribution, we propose a brand new approach to implement high-performance RSA cryptosystems in the latest CUDA GPUs by utilizing the powerful floatingpoint computing resource. Our results demonstrate that the floating-point computing resource is a more competitive candidate for the asymmetric cryptography implementation in CUDA GPUs. In NVIDIA GeForce GTX TITAN, our resulting RSA-2048 decryption reaches a throughput of 42,211 operations per second, which achieves 13 times the performance of the previous floating-point-based implementation. For RSA-3072/4096 decryption, our peak throughput is 12,151 and 5,790, and RSA-4096 implementation is 1.23 times faster than the best integer-based result. We also hope our endeavor can shed light on future research and inspire more case studies on GPU using floating-point computing power as well as other asymmetric cryptography. We will apply these designs to arithmetic, such as the floating-pointbased ECC implementation, and exploit the floating-point power of Tesla P100. Neves and Araujo [22] Henry and Goldberg [23] Jang et al. [19] Emmart and Weems [7] Yang [24] Jeffrey and Robinson [21] Ours CUDA platform Yang et al. also report the latency of RSA-2048 decryption is 6.5 ms (after scaled 6.8 ms) when the Batch Size is 1, at the moment the throughput is 154. [2] The peak 2048-bit RSA throughput, when Threads/RSA is 4 × 2, window size is 6, Max Reg. is 127, and Batch Size is 14 × 64. [3] The peak 4096-bit RSA throughput, when Threads/RSA is 8 × 2, window size is 6, Max Reg. is 127, and Batch Size is 14 × 32.

Disclosure
A preliminary version of this paper appeared under the title Exploiting the Floating-Point Computing Power of GPUs for RSA, in Proc. Information Security, 17th International Conference, ISC 2014, Hong Kong, October 12-14, 2014 [42] (Best Student Paper Award). Dr. Yuan Zhao participated in this work when he studied in Chinese Academy of Sciences and now works in Huawei Technologies, Beijing, China.

Conflicts of Interest
The authors declare that they have no conflicts of interest.