Fuzzy Matching Template Attacks on Multivariate Cryptography: A Case Study

Multivariate cryptography is one of the most promising candidates for post-quantum cryptography. Applying machine learning techniques in this paper, we experimentally investigate the side-channel security of the multivariate cryptosystems, which seriously threatens the hardware implementations of cryptographic systems. Generally, registers are required to store values of monomials and polynomials during the encryption of multivariate cryptosystems. Based on maximum-likelihood and fuzzy matching techniques, we propose a template-based least-square technique to eﬃciently exploit the side-channel leakage of registers. Using QUAD for a case study, which is a typical multivariate cryptosystem with provable security, we perform our attack against both serial and parallel QUAD implementations on ﬁeld programmable gate array (FPGA). Experimental results show that our attacks on both serial and parallel implementations require only about 30 and 150 power traces, respectively, to successfully reveal the secret key with a success rate close to 100%. Finally, eﬃcient and low-cost strategies are proposed to resist side-channel attacks.


Introduction
With the upcoming quantum computers, traditional cryptosystems face huge challenges. Public-key cryptosystems such as Rivest-Shamir-Adleman (RSA) and elliptic curve cryptography (ECC), whose security relies on the difficulty of certain number theoretic problems, are under great threat of quantum attack. In as early as 1994, Peter Shor proposed an algorithm on a quantum computer that efficiently solved such number theoretic problems in polynomial time. Afterward, Monz et al. [1] presented the realization of a scalable Shor algorithm in 2016, which means that once large-scale quantum computers appear, public-key cryptosystems will become insecure. Meanwhile, for symmetric-key primitives, larger keys are required to resist the quantum attack to some extent.
Since Shor's discovery, the theory of post-quantum cryptography has developed significantly. Many cryptographic schemes proposed in the literature, such as codebased cryptography [2] and lattice-based cryptography [3], show great potentiality to resist quantum attacks, while multivariate cryptography is one of the most promising candidates [4]. Afterward, numerous cryptosystems based on multivariate quadratic polynomials have been proposed, such as unbalanced oil-and-vinegar (UOV) and its variant [5], Rainbow [5,6], ZHFE [7], and CHNN-MVC [8].
At Eurocrypt 2006, Berbain et al. [9] presented the first multivariate stream cipher scheme denoted as QUAD, which is referred to as a practical and provable secure stream cipher, as well as a pseudorandom number generator (PRNG). In 2009, Berbain et al. [10] revisited the stream cipher of QUAD and proposed the provable security arguments supporting its conjectured strength for suitable parameter values.
e provable security of QUAD relies on the hardness of solving systems of multivariate quadratic equations. Bardet et al. [11] presented a cryptanalysis algorithm with a complexity bounded by O (2 134.56 ), which means this cryptanalysis method cannot put into practice.
In recent years, GPUs are widely used in cloud computing and blockchain, which faces huge security challenges to guarantee data security and user privacy [12][13][14]. Several GPU acceleration schemes for multivariate systems are proposed to make it suitable for security of cloud computing and blockchain in the quantum world [15,16]. In 2014, Tanaka et al. [15] proposed two efficient parallelization algorithms and a GPU-based multivariable quadratic polynomial system. Furthermore, they proposed several effective parallel implementations of QUAD on GPU to accelerate the computing of quadratic polynomials. In 2018, Liao et al. [16] proposed a GPU acceleration framework for high-order multivariate cryptography systems, where the GPU acceleration schemes made multivariate cryptosystems feasible for cloud computing and blockchain.
Moreover, multivariate cryptosystems are in general computationally efficient, which supports the use of the Internet of ings (IoT) devices. IoT is essentially a network of pervasive devices such as RFID tags, sensors, ASICs, and smart cards, which have rigid cost constraints in terms of area, memory, computing power, and battery supply. Traditional cryptosystems are not entirely applicable to the IoT devices since they are too expensive for such pervasive devices. At fast software encryption (FSE) 2010, Billet et al. [17] showed that QUAD can be converted to efficiently construct a privacy-preserving authentication protocol for RFID with provable security. Arditti et al. [18] presented a QUAD implementation and regarded it as the smallest provably secure stream cipher so far. e smallest QUAD implementation requires only 2961 GE, which makes it a competitive candidate for IoT security. Also, Hamlet et al. [19] proposed a throughput-optimized parallel implementation of QUAD for more secure application scenarios in 2015.
e implementation of cryptography needs to take a wide range of physical attacks into account, especially sidechannel attacks and fault attacks. Side-channel attacks exploit the dependency between physical information (e.g., power consumption, electromagnetic leaks, and timing information) and secret key to enable a divide-and-conquer attack to reveal the key part by part. Typical side-channel attacks include nonprofiled attacks (e.g., correlation power analysis (CPA) [20], mutual information analysis (MIA) [21]) and profiled attacks (e.g., template attacks (TA) [22][23][24][25][26][27][28] and other machine learning-based side-channel attacks [28][29][30][31][32][33][34]). Profiled side-channel attacks are the most powerful attacks, which received a lot of attention in recent years. Samples of power traces are regarded as features, and feature selection methods are needed to reduce the computational complexity and increase the prediction accuracy [28]. Afterward, machine leaning techniques including maximum-likelihood strategy [22][23][24][25][26][27][28], SVM [28][29][30], random forest (RF) [28,29], k-nearest neighbors (KNN) [31], neural networks (NNs) [32], and deep learning (DL) [33,34] are widely applied to build the prediction model. Profiled side-channel attacks include a profiling/training phase and a matching/predicting phase. In the profiling/training phase, machine learning algorithms are fed with labelled power traces captured from a reference device to build the prediction model. In the matching/predicting phase, prediction models are used to predict the correct labels for those power traces captured from a target device.
Template attack was first proposed at CHES'02 [22], which efficiently revealed the key by a maximum-likelihood strategy, and was rapidly accepted as the strongest form of side-channel attack. Original template attack matched only a single power trace, which sometimes failed in the practical attack. Agrawal et al. [23] proposed templatebased DPA attack to accumulate the matching results of power traces, which significantly improved the success rate. Ozgen et al. [24] combined classification algorithms with template attacks in the matching phase to improve the efficiency of attacks. Choudary and Kuhn [25] tackled some of the practical obstacles of template attacks, such as pooled covariance matrices, compression methods, and incompatibility of templates across different devices. Zhang [27] theoretically analyzed the exact relationship between the success rate of template attack and values of different parameters, including signal-to-noise, number of interesting points, and number of power traces. From the viewpoint of machine learning, Picek et al. [28] adopted feature selection techniques to improve the attack efficiency. ey concluded that L1 regularization wrapper and linear SVM hybrid methods performed consistently well for all data sets.
Although side-channel attacks have been developed over 20 years, research about side-channel attacks on multivariate cryptosystems is still in the early stages. Several literatures about side-channel attack on multivariate cryptosystems were published. In as early as 2005, Okeya et al. [35] analyzed the power leakage of addition operations modulo 2 32 of SHA-1 and successfully recovered the secret information of SFLASH, which is the first successful power analysis attack on multivariate cryptography in practice. Later, in 2013, Hashimoto et al. [36] proposed a theoretical method based on fault attack to reveal the partial key of MPKC systems. Yi and Li [37] proposed a fault attack and DPA on ASIC implementation of enTTS scheme in 2017. In 2018, Park et al. [38] presented a correlation power analysis attack against the Rainbow and UOV schemes on an 8-bit AVR microcontroller that yields full secret key recoveries. In 2019, based on the work of Hashimoto et al., Krämer and Loiero [39] complemented the research on fault attacks of multivariate signature schemes. However, their attacks do not lead to complete key recovery on Rainbow and UOV. Recently, Li et al. [40] proposed a CPA attack against serial implementation of QUAD on FPGA. eir work efficiently revealed the secret key but still requires further work to improve success rate.
Li et al. proposed the practical CPA cryptanalysis on serial QUAD (2, 160, 160) with a much lower complexity, but the success rate is only around 85%. Because of the low signal-to-noise ratio, classic template attack and templateattack DPA attack cannot exactly match the templates to achieve a satisfactory success rate. To tackle this issue, we have proposed template-based least-square power analysis on serial QUAD (2, 160, 160). e main contributions of our paper can be highlighted as follows: 2 Discrete Dynamics in Nature and Society (1) By applying the least-square technique to enable fuzzy matching of the templates, which can find the best matching via minimizing the squared sum of errors. As a result, the proposed practical can achieve a success rate of nearly 100%. (2) We also extend the template-based least-square power analysis attack to explore the leakage of parallel implementation of QUAD (2, 160, 160), which has successfully and efficiently revealed the secret key with a success rate also close to 100%. (3) For multivariate cryptography, all monomials and polynomials can be computed in an arbitrary order to break the link between the power consumption and the secret key. We propose two low-cost hiding countermeasures for serial and parallel implementations, respectively, which show great potential to resist side-channel attacks.
e remaining paper is organized as follows: in Section 2, we review the mathematical definition, serial and parallel FPGA implementations of the QUAD stream cipher; in Section 3, the template-based least-square power analysis attacks on the serial and parallel FPGA implementation of the QUAD are presented; experimental results of our attacks are given in Section 4; efficient and low-cost countermeasures to resist side-channel attacks are discussed in Section 5; and Section 6 concludes the paper.

Multivariate Cryptography.
Generally, the mathematical definition of a multivariate quadratic equation with n variables over GF (q) can be written as follows: where α ij , β i , and c are all coefficients over GF (q). Note that the degree of polynomial is up to 2; otherwise, new variables will be introduced to keep the polynomial of degree 2. A multivariate quadratic system Q(X) consisting of m multivariate quadratic equations in n variables over GF (q) is defined as Given a multivariate quadratic system Q(X), the MQ problem is defined as to find a value X � x 1 , . . . , x n , if any, such that Q i (X) � 0 for all 1 ≤ i ≤ m. e MQ problem is proved to be NP hard, even in the smallest finite filed GF (2) [10].
A particular QUAD stream cipher in n variables over GF (q) is specified as QUAD (q, n, r), which computes n + r polynomials per round. As shown in Figure 1, QUAD (q, n, r) consists of an output function S out (X) � (Q n+1 (X), . . . , Q n+r (X)) to produce r outputs as the keystream, and an update function S in (X) � (Q(X)1, . . . , Q n (X)) is used to generate n outputs to update X for the next round. e parameters q, n, and r and the coefficients α ij , β i , and c for S in and S out are public. e QUAD cipher expands a secret initial state X 0 ∈ GF(q) n into a sequence of secret states X 0 , X 1 , X 2 , . . . ∈ GF(q) n and a sequence of output vectors QUAD (2, 160, 160) is a practical version with the security level of at least 2 80 , which is strongly recommended in [10]. QUAD (2, 160, 160) has 160 variables over GF (2), which outputs 160 bits per round, resulting in a set of 320 multivariate quadratic equations.
From a perspective of implementation, operations over GF (2) are more efficient than those over larger fields. Moreover, the monomial forms x i · x i and x i are equal over GF (2); therefore, α ij x i x j and β i x i can be computed together. In the case of randomly generated α ij and c, equations of QUAD over GF (2) can be simplified as which brings great benefits in terms of efficiency and security.

FPGA Serial Implementation of QUAD.
Arditti et al. [18] proposed a compact serial implementation of QUAD, which is believed to be the smallest provably secure stream cipher. As shown in Figure 2, the implementation consists of two main components. e first one is a nonlinear feedback shift register (NFSR), in which the coefficients of α and c are randomly generated. Each monomial of the equation is computed by the second component at every clock tick and accumulated to a result register. Multivariate quadratic equations Q 1 (X), Q 2 (X), . . ., Q n+r (X) are computed sequentially. At every clock tick, the NFSR generates the coefficient. Once a new monomial α ij x i x j of polynomial Q k (X) is computed, its contribution will be accumulated to the temporary register Q k . After n(n + 1)/2 + 1 clock cycles, the polynomial Q k (X) is computed, and the above process is repeated for Q k+1 (X).

FPGA Parallel Implementation of QUAD. Hamlet and
Brocato [19] presented two throughput-optimized parallel implementations of QUAD for a much higher throughput. A QUAD (2, 128, 128) version with the security level of approximately 2 64 is considered, which can be easily extended to another version in GF (2) such as QUAD (2, 160, 160). e coefficients α and c are randomly generated and stored in ROM. Multivariate quadratic equations Discrete Dynamics in Nature and Society Q 1 (X), Q 2 (X), . . . , Q n+r (X) are still computed sequentially, while n monomials of polynomial Q k (X) are computed in parallel at a time to achieve a higher throughput.
modules. Compute Q k (X) � Q k (X) ⊕ 1≤i≤10 V i and store the value in result register Q. (6) Rotate the internal state rotated X by one bit, and go to step 3 to compute next 160 monomials until all monomials of polynomial Q k (X) are completed, which requires ⌈(n + 1)/2⌉ � 81 loops. Note that, in the last loop, only half of the above modules are enabled to compute the last 80 monomials. (7) Repeat the above steps until all quadratic equations are computed.

Power Leakage Model.
A typical CMOS transistor consumes dynamic power when its output signal is converted. Figure 4(a) shows the changing process of a register when the output signal is converted from 0 to 1. A charging current from the power supply to the output capacitance C L and a transient short-circuit current from CMOS transistor are generated. On the contrary, Figure 4(b) shows the discharging process when the output signal is converted from 1 to 0. Only the instantaneous short-circuit current is generated through CMOS transistor. As a result, conversions of the output signal are focused since dynamic power is the major power consumption of the digital logical circuits of ASIC and FPGA. Denote the power consumption of CMOS transistor by P ij when its signal converts from i to j, where i and j equal to 0 or 1. P 01 and P 10 consume dynamic power, while P 00 and P 11 consume only static power. As a result, it generally holds that P 00 ≈ P 11 ≪ P 01 , P 10 . erefore, the power consumption when writing data to a register depends on the number of bit-flips. A hamming distance (HD) model well summarizes the power consumption of a register transition from a previous state to a new state.
Regarding multivariate cryptosystems, which consist of a large number of monomials and polynomials, registers are indeed required to store monomial and polynomial values during the encryption. Serial implementation, for instance, monomials are computed sequentially and accumulated to the temporary register Q k , as identified by rectangle in Figure 2.
e value of register Q k changes to Q k ⊕ α ij x i x j for all monomials. e power consumption of register Q k can be concluded as follows: Consequently, an attacker is possible to predict secret keys x i and x j by observing the power consumption of registers Q l .
Other than serial implementations, parallel implementations compute 160 monomials simultaneously. 4 monomials are accumulated by an AND-XOR module and stored into temporary register P c . According to the parallel implementation described in Section 2.3, when computing the first 160 monomials, the values M c , 1 ≤ c ≤ 40, stored into the temporary register P c are After rotating the internal state rotated x by one bit to compute the next 160 monomials, the values M c ′ , 1 ≤ c ≤ 40, stored into registers P c are a 157,158 x 157 x 158 ⊕ a 158,159 x 158 x 159 ⊕ a 159,160 x 159 x 160 ⊕ a 1,160 x 160 x 1 , c � 40.

Template-Based Least-Square Power Analysis Attack.
Classic side-channel attacks, such as DPA, CPA, and MIA, require a large number of power traces to reveal the key, which means that different plaintexts are needed to be encrypted with the same key for obtaining as much power traces as possible. However, multivariate cryptosystems usually contain limited quadratic equations. Take QUAD as an example, and the key of QUAD is constantly updated after each round of encryption, which only generates n + r power traces with the same key. In this case, machine learning-based side-channel attacks such as template attack have inherent advantages, which can extract the key with much fewer target power traces.
Machine learning-based side-channel attacks are the most powerful attacks. Based on a maximum-likelihood strategy, template attacks reveal the secret key efficiently, which consist of a profiling phase and a matching phase. Classic template attacks match only a single power trace and reveal the key by the Bayes theorem in the matching phase. However, there is not enough valuable information in a Discrete Dynamics in Nature and Society single power trace to reveal the correct key in practical situation; hence, the classic template attack has little prospect of success rate.
To solve this problem, template-based DPA attack was proposed as follows [23]: which accumulates the matching degree of each power trace during template matching to improve the success rate. Unfortunately, due to the low signal-to-noise ratio, the accurate matching method of template-based DPA attack is not applicable. For this reason, we proposed a templatebased least-square (LSQ) power analysis attack, which reveals the key by fuzzy matching. As described in Figure 5, the main idea of template-based LSQ is as follows: (1) Choose a strategy to build templates: according to the power leakage models in equations (4) and (7) (4) Build templates with interesting points: two templates h i � (m i , C i ) corresponding to leakage values 0 and 1 are built, respectively, by covariance matrix C i and mean vector m i , where m i and C i are defined as (5) Match templates: power traces T � t 1 , t 2 , . . . , t D with the same key are captured from the device under attack to match the templates, respectively. Template that leads to the highest probability p(t j ; (m i , C i )) indicates the correct leakage value, where p(t j ; (m i , C i )) is defined as is applied to compare the hypothetical leakage values with the leakage values revealed by the template attack, where i is the key hypothesis. Finally, the correct key is revealed by key � arg min F(i).

Experimental Results and Discussion
As shown in Figure 6, our experimental setup includes a standard evaluation board SAKURA-G, an oscilloscope, and a computer. SAKURA-G is designed for hardware security, which equips with two separate Spartan-6 FPGA chips. One chip serves as the control chip, while another serves as the cryptographic chip. Cryptographic chip performs encryption operations, while the control chip controls the data flow and communicates with the oscilloscope and computer. During encryption, power consumptions of the cryptographic chip are measured by the oscilloscope which is triggered by the control chip. Finally, power analysis attacks are performed on the computer. We first perform a side-channel attack on serial implementation of QUAD (2, 160, 160). In the template building phase, 3000 power traces with different keys and coefficients are captured from a reference device, based on which 25 interesting points are selected by the CPA peak method. Next, we collect two groups of power traces corresponding to leakage values 0 and 1 in equation (4), and each group consists of 25 power traces. Finally, we build two templates, and the result is shown in Figure 7.
In the template matching phase, 320 power traces with the same key are captured from the target device. Template that leads to the highest probability indicates the correct key. e success rate of template-based LSQ attack on serial implementation is shown in Figure 8. When the number of power traces approach 30, the success rate tends to 100%. erefore, a successful attack only requires 30 power traces in serial implementation.
To further illustrate the effectiveness of our attack, the time required for our practical attack on serial QUAD is discussed here. Our attack is performed on a personal computer, which integrates an Intel i5-7500 CPU and 12 GB of RAM. e time for templates building and templates matching depends on the number of power traces for building and matching, respectively. Figure 9 shows the time required for the template building with the number of power traces ranging from 1 to 3000. Figure 10 shows the time required for template matching with the number of power traces ranging from 1 to 30. As our successful attack on the serial implementation of QUAD (2, 160, 160) requires less than 3000 power traces for templates building and 30 power traces for templates matching, the total time required for our successful attack is less than 1010 seconds, according to Figures 9 and 10.
According to the leakage model of parallel implementation in equation (7), 4 bits of the key are simultaneously accumulated into temporary register. Consequently, we need to guess 4 bits at a time. In the template building phase, 10000 power traces with different keys and coefficients are captured from a reference device, based on which 35 interesting points are selected by the CPA peak method. Next, we collect two groups of power traces corresponding      (7), and each group consists of 35 power traces. Finally, we build two templates, and the result is shown in Figure 11.
In the template matching phase, 320 power traces with the same key are captured from the target device. Template that leads to the highest probability indicates the correct key. e success rate of template-based LSQ attack on parallel implementation is shown in Figure 12. When the number of power traces approach 150, the success rate tends to 100%. erefore, 150 power traces are sufficient for a successful attack in parallel implementation. Figure 13 shows the time required for the template building with the number of power traces ranging from 1 to 10000. Figure 14 shows the time required for template matching with the number of power traces ranging from 1 to 180. As our successful attack on the parallel implementation of QUAD (2, 160, 160) requires less than 10000 power traces for templates building and 150 power traces for templates matching, the total time required for our successful attack is less than 1977 seconds, according to Figures 13 and 14.
In order to compare the success probability of the attacks, we performed our attack, template attack, and template-based DPA attacks dozens of times, respectively. We compare the typical results in Table 1, which shows that our proposed attack has the highest accuracy, greatly outperforming template attack, and template-based DPA attack.

Suggested Countermeasures
Side-channel countermeasures aim at reducing the data dependency between physical information and secret key. Usually, masking and hiding technologies are adopted. For multivariate cryptography, all monomials and polynomials can be computed in an arbitrary order. erefore, the basic idea of countermeasures for multivariate cryptography is to randomly change the sequence of these operations.
A QUAD (q, n, r) has (n + r) × n(n + 1)/2 monomials, which can be randomly computed in ((n + r) × n(n + 1)/2)!    orders. However, it is too expensive to implement such algorithm. We propose a low-cost shuffling countermeasure by partially changing the orders of monomials for each polynomial equation Q k (X). Starting with two randomly generated index i s and j s , 1 < i s ≤ j s ≤ n, each polynomial is computed in the order as follows: A random index generator is required to generate such an order index, as shown in Figure 15, whose implementation requires only 556 GE.
For the parallel implementations, we proposed a lowcost hiding countermeasure by partially randomizing the initial value of rotated x to shuffle the computation orders of

Conclusions
Multivariate cryptosystems consist of a large number of monomials and polynomials, where registers are required to store monomial and polynomial values during the encryption. erefore, a hamming distance (HD) model of the register will leak the secret of the implementation. By applying the least-square technique to enable fuzzy matching of the templates, we propose a practical templatebased least-square power analysis, where both the serial and parallel implementations of QUAD (2, 160, 160) can achieve a success rate close to 100%. e proposed two low-cost hiding countermeasures for serial and parallel implementations are also validated to be effective, where all monomials and polynomials can be computed in an arbitrary order to break the link between the power consumption and the secret key in multivariate cryptography. Our proposed attacks require only 30 and 150 power traces, respectively, to successfully reveal the secret key. Future work will focus on low-cost countermeasures of multivariate cryptography for IoT devices to resist side-channel attacks.

Data Availability
e mat data used to support the findings of this study are available from the corresponding author upon request.