Efficient Parallel Implementation of Matrix Multiplication for Lattice-Based Cryptography on Modern ARM Processor

1Pusan National University, School of Computer Science and Engineering, San-30, Jangjeon-Dong, Geumjeong-Gu, Busan 609-735, Republic of Korea 2Hansung University, IT Engineering, 116 Samseong-Yoro-16-Gil Seongbuk-gu Seoul 136-792, Republic of Korea 3Cryptographic Technical Team, Security Industry Division, Korea Internet Security Agency, 6F, 9, Jinheung-gil, Naju, Jeollanam-do, 58324, Republic of Korea


Introduction
In these days, with the development of quantum computing technologies, there are security threats to the existing block cipher due to the Grover's algorithm [1] and public key cryptographic algorithms such as RSA, which is based on the integer factorization problem, the discrete logarithm problem, and ECC, which is based on elliptic curve discrete logarithm problem according to Shor's algorithm [2].For this reason, many cryptographers are designing new cryptographic algorithms, such as lattice-based cryptography, multivariate-based cryptography, Hash-based cryptography, code-based cryptography, and supersingular elliptic curve isogeny-based cryptography, which are safe in a quantum computing environment.In PQCrypto 2016, the National Institute of Standards and Technology (NIST) announced the Postquantum Cryptography Standardization competition.The submission deadline was November 30, 2017 and the first standardization workshop date was 11 April 2018.Many Postquantum cryptographic algorithms have been proposed.Lattice-based cryptography, which is based on Learning with Errors (LWE) problems, used matrix multiplication and vector addition operations for key generation, encryption, and decryption.However, matrix multiplication and vector addition for a large matrix take much time for key generation, encryption, and decryption.For efficient implementation of lattice-based cryptography, speed optimized implementation on matrix multiplication and vector addition is needed.In this paper, we propose efficient parallel implementation of matrix multiplication and vector addition for lattice-based cryptography based LWE problems using ARM NEON SIMD intrinsic functions.
The remainder of this paper is organized as follows.Section 2 discusses the literature related to the LWE problems, NIST PQC Standardization, Lizard lattice-based cryptography, ARM NEON SIMD, and related studies on efficient implementation of lattice-based cryptography.We propose efficient ARM NEON optimized matrix multiplication and vector addition implementation methods in Section 3. Section 4 gives experimental and evaluation results on proposed ARM NEON optimized matrix multiplication and vector addition implementation and Lizard CCA key generation 2 Security and Communication Networks with the proposed method.Section 5 provides some final conclusions.

Related Studies
In this section, we describe related studies on LWE problems and NIST PQC standardization.
2.1.Learning with Errors (LWE) Problems.Regev introduced the Learning with Errors (LWE) problem [4].For example, for an n-dimensional vector  ∈ Z   and an error distribution  over Z, the LWE distribution   ,, () over Z   × Z  is obtained by choosing a vector uniformly and randomly from Z   and an error e from  and using The search LWE problem finds  ∈ Z   for given arbitrarily many independent samples (a i , b i ) from   ,, ().The hardness of the decision LWE problem is guaranteed by the worst case hardness of the standard lattice problems, such as the decision version of the shortest vector problem (GapSVP) and the shortest independent vectors problem (SIVP).Peikert et al. [5,6] improved the reduction of the classical version.Brakerski et al. [6] proved that the LWE problem with a binary secret is at least as hard as the original LWE problem, and Cheon et al. [7] proved the hardness of the LWE problem with a sparse secret.According to these research results, in these days, the LWE problem has been used as a hardness assumption for lattice-based postquantum cryptography.In lattice-based cryptography, errors (E) can be used during encryption and decryption procedures, and they are generated by random samplers, such as the Gaussian sampler.During encryption and decryption procedures, they used matrix multiplication between matrix A and secret matrix S and then vector addition with errors vector E. For example, Peikert [5] proposed a cryptosystem based on the LWE problem, which is secure against any chosen-ciphertext attack, and Lin et al. [8] proposed a key exchange scheme based on the LWE problem.Many latticebased cryptography systems provide security in a quantum computing environment based on LWE problems.In the key generation step of Lizard, it first samples a secret vector  ∈ {−1, 0, 1}  , a random matrix  ∈ Z × , and an error vector  ←    of which the components are expected to be small.The secret key is written as sk ← s, and the public key is written as pk ← (A,b),where  =  +  ∈ Z   .Hence, the public key q is an instance of LWE with the secret vector s.There are five types of parameter sets of Lizard.CCA: CCA CATEGORY1 N536, CCA CATEGORY1 N663, CCA CATEGORY3 N816, CCA CATEGORY3 N952, CCA CATEGORY5 N1088, and CCA CATEGORY5 N1300.The parameter sets of Lizard.KEM are similar to the parameter sets of Lizard.CCA.However, RLizard.CCA and RLizard.KEM have four types of parameter sets: RING CATEGORY1, RING CATEGORY3 N1024, RING CATEGORY3 N2048, and RING CATEGORY5.In this study, we used the proposed method for efficient matrix multiplication and vector addition using ARM NEON SIMD on the Lizard.CCA key generation step and evaluated the performance of proposed method on the proposed methods application aspect.

ARM NEON.
ARM NEON is an advanced single instruction multiple data (SIMD) engine for the ARM Cortex-A series and Cortex-R52 processor [9].It was introduced to the ARMv7-A and ARMv7-R profiles, and it is also now as an extension to the ARMv8-A and ARMv8-R profiles.ARM NEON supports 128-bit size Q registers (Q0-Q15).Q registers can be written as 4 32-bit size data, 8 16-bit size data, and 16 8-bit size data.Each Q register can be separated into 2 D registers (64-bit size) as in Figure 1.
The ARM Cortex-A series is used for smartphones and some IoT devices, such as the Raspberry Pi series.For this reason, ARM NEON SIMD is used for highperformance multimedia processing and big-data processing in the Cortex-A series environment.
There are two methods to use ARM NEON.The first one uses ARM NEON intrinsic functions that can be mapped to the ARM NEON assembly instruction by 1-1.The other uses ARM NEON assembly code.In this study, we used ARM NEON intrinsic functions for efficient development of the proposed method.
In 2012, Bernstein introduced implementation of a cryptographic algorithm using ARM NEON [10].Since then, there have been many research studies on efficient implementation of cryptographic algorithms.The Streit method [11] proposed efficient implementation of a NewHope postquantum key exchange scheme using NEON in an ARMv8-A environment.Seo [12] proposed a high-performance implementation of SGCM in an ARM environment using NEON.Liu Zhe et al. [13] proposed efficient Number Theoretic Transform (NTT) implementation using NEON for efficient Ring-LWE software implementation in a Cortex-A series environment.Seo et al. [14] proposed a compact GCM implementation in a 32-bit ARMv7-A processor environment using NEON.

Related Studies on Efficient Implementation of Lattice-Based Cryptography.
There are many research results on efficient implementation of lattice-based cryptography.Pöppelmann [15] proposed an efficient implementation of Ring-LWE encryption in a reconfigurable hardware 8 bit microcontroller environment and software implementation of GLP on Intel/AMD CPUs and BLISS in the Cortex-M4F environment.Nejatollahi et al. [16] introduced trends and challenges for lattice-based cryptography software implementation.In this paper, the time complexity of matrix-to-matrix/vector multiplication is O( 2 ) and it is needed to implement matrix multiplication efficiently.The Liu Zhe method [17] surveyed implementation of lattice-based cryptography on IoT devices and suggested that the Ring-LWE-based cryptosystem would play an essential role in postquantum edge computing and the postquantum IoT environment.Lie Zhe et al. [18] proposed high-performance ideal lattice-based cryptography on an 8-bit AVR microcontroller.They proposed an efficient and secure implementation of Ring-LWE encryption in an 8-bit AVR environment against timing side-channel attack.Bos, Joppe, et al. [19] proposed CRYSTALS-Kyber, which is module-lattice-based KEM, which provides CCA-secure.In their paper, they proposed AVX2 implementation and performance of CRYSTALS-Kyber.The McCarthy method [20] proposed a practical implementation of identity-based encryption over NTRU lattice-based cryptography on an Intel Core i7-6700 CPU.They optimized the DLP-IBE and Gaussian sampler for efficient implementation.Yuan, Ye, et al. [21] proposed memory-constrained implementation of lattice-based encryption in a standard Java card environment.For efficiency, they optimized Montgomery Modular Multiplication (MMM) and Fast Fourier Transform (FFT) for NTT.Oder, Tobias, et al. [22] proposed practical CCA2secure and masking Ring-LWE implementation in an ARM Cortex-M4F environment.They implemented masked PRNG (SHAKE-128) for a countermeasure of a side-channel attack.The O'Sullivan method [23] reviewed the state-ofthe-art in efficient designs for lattice-based cryptography hardware and software implementation.

Proposed Method
In this section, we describe our proposed method for efficient matrix multiplication and vector addition using ARM NEON SIMD.

Problem on Matrix Multiplication and Vector Addition
Implementation.First, we describe the problem on matrix multiplication and vector addition for lattice-based cryptography based on the LWE problem.For example, there are Matrix A ( , , 0 ≤  ≤ , 0 ≤  ≤ ), Matrix S ( , , 0 ≤  ≤ , 0 ≤  ≤ ), and Matrix E ( , , 0 ≤  ≤ , 0 ≤  ≤ ) as in Figure 2. If we want to implement matrix multiplication and vector addition, we have to multiply each element on the row of Matrix A and the column of Matrix S. After matrix multiplication, we add the element of the matrix multiplication result and the element of Matrix E. These procedures have a problem, multiplying and addition between each element of the matrix, so computing takes a long time.
For solving and efficient implementation of matrix multiplication and vector addition, we propose efficient matrix multiplication and vector addition using NEON in an ARM Cortex-A environment.

Proposed Efficient Matrix Multiplication and Vector Addition.
For efficient matrix multiplication and vector addition, we used ARM NEON intrinsic functions as shown in Table 1.Using ARM NEON SIMD, we could compute 128bit size data at each instruction.ARM NEON supports the vector interleave function, vector multiplying accumulation, lane broadcast, and extracting lanes from a vector into a register.For this reason, we proposed matrix multiplication after the matrix transpose for NEON SIMD implementation using ARM NEON intrinsic functions as in Table 1.For an efficient matrix transpose, we used the vector interleave NEON function for efficient implementation.We used vector multiplying accumulation and extracting lanes from a vector into a register and NEON lane broadcast for efficient matrix multiplication.
The NEON data load operation intrinsic function can load data (128-bit) from an 8/16/32-bit data array with a size of 16, 8, or 4. Figure 3 describes a 128-bit size NEON data load from a 16-bit×8 size data array using only the NEON data load intrinsic function.
The NEON data store operation intrinsic function can store data (128-bit) into an 8/16/32-bit data array with a size of 16, 8, or 4. Figure 4 describes a 128-bit size NEON data store into a 16-bit×8 size data array using only the NEON data store intrinsic function.The NEON extracting lane from a NEON vector to a register extracts data according to the lane number value.Figure 5 describes the NEON extracting lane number 2 data from NEON vector a (16-bit×8 size) to an unsigned short 16-bit size data register r.The NEON extracting lane from a NEON vector to a register operation can also extract data such as 8/16/32-bit data from the NEON vector.This NEON intrinsic function will be used at data accumulate and store into register during matrix multiplication procedure.The details of NEON extracting intrinsic function usage are described in Algorithm 2.
The NEON lane broadcast intrinsic function sets all the lane data in the NEON vector at the same value as in Figure 6.This NEON intrinsic function is used for initializing the accumulation NEON vector as zero during the matrix multiplication procedure.The details of the NEON lane broadcast intrinsic function usage are described in Algorithm 2.
The NEON vector interleave function supports the vector interleave between 2 NEON registers as in Figure 7.After the vector interleave, the result of the vector interleave is to store at the NEON register array (with a size of 2, 2 128-bit data).If we implemented matrix transpose using C language, we have to exchange between elements on the matrix.However, if we use NEON vector interleave, we can exchange 128-bit size data at each instruction.This NEON intrinsic function is used for matrix element transpose during the matrix transpose procedure in Algorithm 1.
Algorithm 1 describes the matrix transpose method using NEON for efficient matrix multiplication.In Algorithm 1, from lines No. 2 to No. 5, it computes the matrix index which is located at outbound of the matrix as index which is located at inbound for NEON SIMD matrix transpose.At that time, the matrix row index can be set as the matrix row index (BLOCK TRANSPOSE-N % BLOCK TRANSPOSE) and the matrix column index can be set as the matrix column index (BLOCK TRANSPOSE-L % BLOCK TRANSPOSE).
After calculating the matrix index, it repeats the data load on NEON registers and the vector interleave between NEON registers until the matrix transpose is done for each BLOCK TRANSPOSE from lines No. 7 to No. 56.In Algorithm 1, we assume that each data element of the matrix has 16-bit size data so BLOCK TRANSPOSE means 8 because each NEON register size is 128-bit (16-bit×8 data).After matrix transpose at each BLOCK TRANSPOSE, it stores NEON register data to the transposed matrix array.
For matrix multiplication and vector addition, if we use C language, we have to multiply element by element which are on each matrix and, after matrix multiplication, we have to add each element in the matrix and vector, which takes a long execution time according to the increasing Figure 5: NEON extracting lane from a vector to a register.matrix size.However, if we use NEON vector multiplication and accumulation as in Figure 8, we can implement matrix multiplication and vector addition by 128-bit size data at each NEON instruction, which accelerates the performance of the matrix multiplication and vector addition.
We propose an efficient matrix multiplication and accumulation method as in Algorithm 2 based on ARM NEON SIMD.Algorithm 2 is conducted after the matrix transpose.In Algorithm 2, LANE SHORT NUM has the same value as BLOCK TRANSPOSE in Algorithm 1. Line No. 3 in Algorithm 2 describes setting the NEON register sum vect value as 16-bit data 0 using the NEON Intrinsic function (vdupq) for lane broadcasting as the same value.From lines No. 4 to No. 7, it loads data from matrix A and matrix S to the NEON register according to each matrix index.Then it multiplies and accumulates NEON registers for matrix multiplication and vector addition within N/LANES SHORT NUM.For lines No. 8 and 9, it stores the NEON register value on the array (16-bit data and array size: 8) and accumulates the values on matrix E according to the matrix index.Then, it stores the NEON vector into the register and accumulates element values in the register and stores the result on Matrix E according to the Matrix E index.From lines No. 10 to No. 12, it calculates matrix multiplication and vector addition between matrix elements, which are located at outbound of the matrix size % NEON register lane size.In this part, if the row and column size of the matrix is even, then it does not operate.Using Algorithm 2, we calculate the matrix multiplication and vector addition using NEON, and if the position of matrix element is greater than the NEON register lane size, we used normal matrix multiplication and vector addition using C.
As previously described, we propose an efficient matrix transpose, matrix multiplication, and vector addition.Now, we propose an efficient matrix transpose, multiplication, and vector addition for LWE in lattice-based cryptography as in Algorithm 3. In Algorithm 3, we transpose matrix S using Algorithm 1 and calculate matrix multiplication and vector (matrix E) addition using Algorithm 2.
Figure 9 describes Algorithm 3 as a block diagram.In Figure 9, dark blue and dark red parts are calculated using NEON SIMD for matrix multiplication and vector addition based on NEON multiplication and accumulation.At that time, Matrix S is transposed by the NEON based matrix transpose operation in Algorithm 1. Positions of light blue and light red parts are greater than matrix row value/NEON lane size or columns value/NEON lane size.These parts are calculated using C and the normal method for matrix multiplication and vector addition.
If we re-used the ARM NEON SIMD data register, which was the result data right before the operation as operand data at the next operation during NEON SIMD programming, it has data dependency, and data dependency causes a Read After Write (RAW) data hazard (aka, stall) which takes some clock cycles to load data that was result data right before operation again.To avoid the data hazard and enhance performance, we scheduled order of NEON register used.For efficient NEON SIMD implementation, we used fully NEON Q registers (Q0-Q15).

Evaluation.
To evaluate our method, we measured the average execution time for 1,000 periods of operation according to Lizard.CCA parameters.For Lizard.CCA CATEGORY5 N1088 and Lizard.CCA CATEGORY5 N1088 parameters, we could not measure the execution time.First, we measured and compared the performance of the proposed matrix transpose method and normal C version as in Table 2. Our proposed matrix transpose method performed better than C version (with GCC autovectorization).The C version (with GCC autovectorization) had a low performance because it had some conditional branches, such as 'while' and 'if ' statements.After we measured the proposed matrix transpose method, we measured the proposed matrix multiplication and vector addition.For an objective evaluation, we compared the performance of the proposed method with the C version from the matrix multiplication and vector addition part in the Lizard.CCA key generation step [3] according to the Lizard.CCA parameters.The C version from Lizard.CCA [3] was submitted to NIST PQC Standardization round 1, and it was normal C version matrix multiplication using C pointer.The proposed method for matrix multiplication and vector addition included the matrix transpose part.Table 3 describes the comparison results between the C version [3] (with GCC autovectorization) and the proposed  method.The proposed method improved the performance at the parameters by 36.93%,6.95%, 32.92%, and 7.66%, respectively.Our proposed methods performed better.Next we applied the proposed methods on the Lizard.CCA key generation step [3] for objective evaluation.Table 4 describes the performance comparison results between the Lizard.CCA key generation step [3] and the proposed method.The proposed methods with the Lizard.CCA key generation steps had improved performance at the parameters by 7.04%, 3.66%, 7.57%, and 9.32%, respectively, over the original Lizard.CCA key generation step [3].
According to Tables 3 and 4, the proposed methods for efficient matrix multiplication had improved performance.However, in the case of the Lizard.CCA CATEGORY3 N663 parameter, the rate of increase in performance was lower than the others because parameter N was 663 and it had a remainder as 7 (663 = 8 × 82 + 7) so it was necessary to do matrix multiplication for matrix elements that were located from 656 to 663 using normal method.problem.The LWE problem-based procedures need matrix multiplication between huge size matrices.However, normal matrix multiplication calculates element by element on the matrix.For efficient matrix multiplication, we proposed matrix multiplication and vector addition with a matrix transpose using ARM NEON SIMD techniques for efficiency.The proposed matrix multiplication and vector addition with matrix transpose method improved performance at each parameter by 36.93%,6.95%, 32.92%, and 7.66%, respectively, and the proposed method with Lizard.CCA key generation steps have improved performance at each parameter by 7.04%, 3.66%, 7.57%, and 9.32%, respectively, over the original Lizard.CCA key generation step [3].In the future, research on efficient matrix multiplication on matrix elements that are located at outbound of NEON register lane size is needed for further improved efficiency and using a fully NEON method.We will research on efficient implementation of matrix multiplication and vector addition for lattice-based cryptography using full NEON SIMD for any parameters, mixing ARM NEON/ARM assembly instruction, and AVX2 SIMD in an Intel x64 environment.

4. 1 .
Experiment.Our experimental environment was Raspberry Pi 3 Model B. Raspberry Pi 3 Model B has a Broadcom BCM2387 chipset (1.2GHz Quad-Core ARM Cortex-

Figure 9 :
Figure 9: Proposed matrix multiplication and vector addition.

Table 1 :
ARM NEON intrinsic functions for the proposed method.