^{1}

^{2}

^{3}

^{3}

^{1}

^{1}

^{2}

^{3}

Recently, various types of postquantum cryptography algorithms have been proposed for the National Institute of Standards and Technology’s Postquantum Cryptography Standardization competition. Lattice-based cryptography, which is based on Learning with Errors, is based on matrix multiplication. A large-size matrix multiplication requires a long execution time for key generation, encryption, and decryption. In this paper, we propose an efficient parallel implementation of matrix multiplication and vector addition with matrix transpose using ARM NEON instructions on ARM Cortex-A platforms. The proposed method achieves performance enhancements of 36.93%, 6.95%, 32.92%, and 7.66%. The optimized method is applied to the Lizard. CCA key generation step enhances the performance by 7.04%, 3.66%, 7.57%, and 9.32% over previous state-of-the-art implementations.

In these days, with the development of quantum computing technologies, there are security threats to the existing block cipher due to the Grover’s algorithm [

The remainder of this paper is organized as follows. Section

In this section, we describe related studies on LWE problems and NIST PQC standardization.

Regev introduced the Learning with Errors (LWE) problem [

The United States National Institute of Standards and Technology (NIST) has initiated postquantum cryptography standardization since 2016. The submission deadline was November 30, 2017. A total of 69 postquantum cryptographic algorithms were submitted on NIST PQC standardization Round 1: 26 lattice-based cryptographic algorithms (5 signatures, 21 KEM (key encapsulation mechanism)/encryption), 19 code-based cryptographic algorithms (3 signatures, 16 KEM/encryption), 9 multivariate-based (7 signatures, 2 KEM/encryption), 3 hash-based signature schemes, and 8 others (2 signatures, 6 KEM/encryption) were submitted. Four algorithms have been withdrawn. The lattice-based cryptography is the most proposed type of postquantum cryptography for NIST PQC standardization according to NIST PQC standardization Round 1 submission. Most lattice-based cryptographic algorithms are based on the LWE problem for providing security in a quantum computing environment and efficiency of implementation. The first NIST PQC standardization conference was scheduled to take place on April 11-13, 2018. After the first NIST PQC standardization, it will take about five to six years until the final decision for NIST PQC standardization is made. During PQC standardization, efficient implementation of submitted postquantum cryptographic algorithms is an important issue.

Lizard [

ARM NEON is an advanced single instruction multiple data (SIMD) engine for the ARM Cortex-A series and Cortex-R52 processor [

ARM NEON register bank.

The ARM Cortex-A series is used for smartphones and some IoT devices, such as the Raspberry Pi series. For this reason, ARM NEON SIMD is used for high-performance multimedia processing and big-data processing in the Cortex-A series environment.

There are two methods to use ARM NEON. The first one uses ARM NEON intrinsic functions that can be mapped to the ARM NEON assembly instruction by 1-1. The other uses ARM NEON assembly code. In this study, we used ARM NEON intrinsic functions for efficient development of the proposed method.

In 2012, Bernstein introduced implementation of a cryptographic algorithm using ARM NEON [

There are many research results on efficient implementation of lattice-based cryptography. Pöppelmann [

In this section, we describe our proposed method for efficient matrix multiplication and vector addition using ARM NEON SIMD.

First, we describe the problem on matrix multiplication and vector addition for lattice-based cryptography based on the LWE problem. For example, there are Matrix

Matrix multiplication and vector addition (existing method).

For solving and efficient implementation of matrix multiplication and vector addition, we propose efficient matrix multiplication and vector addition using NEON in an ARM Cortex-A environment.

For efficient matrix multiplication and vector addition, we used ARM NEON intrinsic functions as shown in Table

ARM NEON intrinsic functions for the proposed method.

| |
---|---|

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

| |

The NEON data load operation intrinsic function can load data (128-bit) from an 8/16/32-bit data array with a size of 16, 8, or 4. Figure

NEON data load operation.

The NEON data store operation intrinsic function can store data (128-bit) into an 8/16/32-bit data array with a size of 16, 8, or 4. Figure

NEON data store operation.

The NEON extracting lane from a NEON vector to a register extracts data according to the lane number value. Figure

NEON extracting lane from a vector to a register.

The NEON lane broadcast intrinsic function sets all the lane data in the NEON vector at the same value as in Figure

NEON lane broadcast operation.

The NEON vector interleave function supports the vector interleave between 2 NEON registers as in Figure

1:

2:

3: let

4:

5:

6: let

7:

8:

9:

10:

11:

12:

13:

14:

15:

16:

17:

18:

19:

20:

21:

22:

23:

24:

25:

26:

27:

28:

29:

30:

31:

32:

33:

34:

35:

36:

37:

38:

39:

40:

41:

42:

43:

44:

45:

46:

47:

48:

49:

50:

51:

52:

53:

54:

55:

56:

57:

1:

2:

3:

4:

5:

6:

7:

8:

9:

10:

11:

12:

13:

VZIP ARM NEON interleave operation.

Algorithm

After calculating the matrix index, it repeats the data load on NEON registers and the vector interleave between NEON registers until the matrix transpose is done for each

For matrix multiplication and vector addition, if we use C language, we have to multiply element by element which are on each matrix and, after matrix multiplication, we have to add each element in the matrix and vector, which takes a long execution time according to the increasing matrix size. However, if we use NEON vector multiplication and accumulation as in Figure

VMLA ARM NEON multiply accumulation operation.

We propose an efficient matrix multiplication and accumulation method as in Algorithm

As previously described, we propose an efficient matrix transpose, matrix multiplication, and vector addition. Now, we propose an efficient matrix transpose, multiplication, and vector addition for LWE in lattice-based cryptography as in Algorithm

1:

2:

3:

Figure

Proposed matrix multiplication and vector addition.

If we re-used the ARM NEON SIMD data register, which was the result data right before the operation as operand data at the next operation during NEON SIMD programming, it has data dependency, and data dependency causes a Read After Write (RAW) data hazard (aka, stall) which takes some clock cycles to load data that was result data right before operation again. To avoid the data hazard and enhance performance, we scheduled order of NEON register used. For efficient NEON SIMD implementation, we used fully NEON Q registers (Q0-Q15).

In this section, we describe the experimental environment, the performance measurement, and the evaluation of the proposed method. For objective evaluation, we applied the proposed method on the Lizard.CCA key generation step, which used the LWE problem for key generation.

Our experimental environment was Raspberry Pi 3 Model B. Raspberry Pi 3 Model B has a Broadcom BCM2387 chipset (1.2GHz Quad-Core ARM Cortex-A53) and 1GB LPDDR2 memory. The operating system is Raspbian GNU/Linux 8.0 (Jessie). We used GCC compiler version 4.9.2 and the compile options

To evaluate our method, we measured the average execution time for 1,000 periods of operation according to Lizard.CCA parameters. For Lizard.CCA CATEGORY5_N1088 and Lizard.CCA CATEGORY5_N1088 parameters, we could not measure the execution time. First, we measured and compared the performance of the proposed matrix transpose method and normal C version as in Table

Matrix transpose performance (Unit: ms).

| | | | |
---|---|---|---|---|

| ||||

536 | 1024 | 256 | 364.2304 | 0.446443 |

| ||||

663 | 1024 | 256 | 630.0066 | 0.707373 |

| ||||

816 | 1024 | 384 | 970.4782 | 1.78282 |

| ||||

952 | 1024 | 384 | 1172.607 | 2.078113 |

After we measured the proposed matrix transpose method, we measured the proposed matrix multiplication and vector addition. For an objective evaluation, we compared the performance of the proposed method with the C version from the matrix multiplication and vector addition part in the Lizard.CCA key generation step [

Matrix multiplication performance (unit: ms).

| | | | |
---|---|---|---|---|

| ||||

536 | 1024 | 256 | 148.8991 | 93.91285 |

| ||||

663 | 1024 | 256 | 171.0976 | 159.2069 |

| ||||

816 | 1024 | 384 | 334.7499 | 224.5633 |

| ||||

952 | 1024 | 384 | 391.7564 | 361.7326 |

Our proposed methods performed better. Next we applied the proposed methods on the Lizard.CCA key generation step [

Lizard.CCA key generation performance (unit: ms).

| | | | |
---|---|---|---|---|

| ||||

536 | 1024 | 256 | 622.920 | 579.071 |

| ||||

663 | 1024 | 256 | 736.950 | 709.942 |

| ||||

816 | 1024 | 384 | 1075.164 | 993.760 |

| ||||

952 | 1024 | 384 | 1239.633 | 1124.144 |

According to Tables

Nowadays, many postquantum cryptography systems are being developed to deal with quantum computing technologies and security threats to the existing cryptosystem. NIST is working on postquantum cryptography standardization. A large part of the submissions to NIST’s PQC Standardization competition is lattice-based cryptography, and many lattice-based cryptographic algorithms are based on the LWE problem. The LWE problem-based procedures need matrix multiplication between huge size matrices. However, normal matrix multiplication calculates element by element on the matrix. For efficient matrix multiplication, we proposed matrix multiplication and vector addition with a matrix transpose using ARM NEON SIMD techniques for efficiency. The proposed matrix multiplication and vector addition with matrix transpose method improved performance at each parameter by 36.93%, 6.95%, 32.92%, and 7.66%, respectively, and the proposed method with Lizard.CCA key generation steps have improved performance at each parameter by 7.04%, 3.66%, 7.57%, and 9.32%, respectively, over the original Lizard.CCA key generation step [

Proposed matrix transpose, multiplication, and vector addition implementation source codes are uploaded to Github repository (https://github.com/pth5804/MatTrans_Mul_NEON_PQC).

The authors declare that there are no conflicts of interest regarding the publication of this paper.

This work of Taehwan Park and Howon Kim was supported by the Ministry of Trade, Industry, & Energy (MOTIE, Korea) under the Industrial Technology Innovation Program (no. 10073236). This work of Hwajeong Seo was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (no. NRF-2017R1C1B5075742). This work of Junsub Kim and Haeryong Park was supported by the Institute for Information & communications Technology Promotion (IITP) grant funded by the Korean government (MSIP) (no. 2017-0-00616, development for lattice-based postquantum public key cryptographic scheme).