Gap between Prediction and Truth: A Case Study of False-Positives in Leakage Detection

Since leakage detection was introduced as a popular side-channel security assessment, it has been plagued by false-positives (a.k.a. type I errors). To fix this error, the previous solutions set detection thresholds based on an assumption-based prediction of falsepositive rate (FPR). However, this study points out that such a prediction (of FPR) may be inaccurate. We notice that the prediction in EuroCrypt2016 is much smaller than (approximately 1/779 times) the true FPR. )e gap between prediction and truth, called underpredicted false-positives (UFP), leads to severe false-positives in leakage detection.)en, we check the statistical distribution of test statistics to analyze the cause of UFP. Our analysis indicates that the overlap between cross-validation (CV) blocks gives rise to an assumption error in the distribution of the CV-based estimates of ρ-statistics, which is the root cause of UFP. )erefore, we tackle the UFP by eliminating the overlap between blocks. Specifically, we propose a profiling-shared validation (PSV) and utilize this validation to improve the detection of any-variate any-order leakages. Our experiments show that the PSV solves the UFP and saves more than 75% of the test time costs. In summary, this article reports a potential flaw in leakage detection and provides a complete analysis of the flaw for the first time.


Introduction
Side-channel attack (SCA) utilizes the physical leakages (such as execution time [1], power consumption [2], and electromagnetic radiation [3]) of a running device to retrieve some secrets (e.g., the private key) inside the device. Since it was proposed by Kocher [1], such an attack has seriously threatened the security of cryptographic modules, including smart cards [2,3] and FPGAs [4,5].
us, side-channel security assessments have been developed to evaluate the security of these modules against SCA [6].
Leakage detection is a side-channel security assessment that determines whether the leakages depend on the data (e.g., plaintext or ciphertext) accessible to an attacker [7]. In most cases, it uses a hypothesis test to identify the datadependent leakages by comparing the test statistics with a certain threshold [8][9][10]. If no statistics exceed the threshold, the module is allowed to pass the test; otherwise, the module is rejected. In contrast to the attack-based assessment [6], leakage detection exploits the dependency between leakage and data rather than the key recovery. Hence, it has advantages in computational complexity and works well in black-box environments [10]. A typical example is the test vector leakage assessment (TVLA) [8].
is assessment utilizes Weltch's t-test to compare t-statistic with a threshold of 4.5 and identifies the leakages dependent on the plaintext. In EuroCrypt2016, Durvaux et al. found that TVLA failed to detect the plaintext-dependent leakages [10]. en, they put forward a correlation-based leakage detection (ρ CV -test) to identify the hard-to-detect leakages for TVLA. e ρ CV -test takes advantage of cross-validation (CV) to obtain a wellestimated ρ-statistic r z,CV and then compares the r z,CV with a threshold of 5.0 to assess any-variate any-order leakages. Based on the ρ CV -test, more assessments have emerged, such as the χ 2 -test [11], T 2 -test (and its simplification D-test) [12], KS-test [13], ANOVA [14], and DL-based test [15].
However, as discussed in [10], some of the identified plaintext-dependent leakages are in fact not related to the plaintext. In other words, leakage detection is challenged by false-positives (a.k.a. type I errors). To address this challenge, the frequently used solutions carefully set the threshold for leakage detection according to an acceptable false-positive rate (FPR) [16,17]. Such solutions rely on an assumed distribution of the test statistics to make a fast prediction of FPR. For example, Durvaux et al. at EuroCrypt2016 chose 5.0 as the threshold of the ρ CV -test based on the assumption that r z,CV follows the standard normal distribution (when ρ-statistic r z � 0) [10]. is choice corresponds to a theoretical p value (i.e., the predicted FPR of a single test) below 10 − 6 and reduces the predicted FPR (PFPR) of 10 4 tests to 0.0057 [17].
Obviously, the above solutions demand that the assumption must be close enough to the true distribution of the test statistics; otherwise, an assumption error (AE) in the distribution may lead to an inaccurate prediction of FPR. As far as we know, the potential risk of AE has not been studied in previous work.
is work studies the false-positives in the ρ CV -test and focuses on the potential AE in [10]. We first propose an estimation algorithm that statistically approximates the true FPR of the ρ CV -test. After running the algorithm, we notice the underpredicted false-positives (UFP) that the true FPR (of ρ CV -test) is about 779 times the prediction (in [10] at the threshold of 5.0). e UFP implies that the previous prediction of FPR may be unreliable. Second, we discover an AE in the statistical distribution of r z,CV . Due to the overlap between the training and the test blocks, there is a nonnegligible error between the assumed distribution and the true distribution (of r z,CV ). is error explains well why the PFPR of the ρ CV -test deviates from the truth.
ird, we present a new time-efficient validation, named profilingshared validation (PSV) to tackle the UFP (and AE). e PSV splits the samples into m + k nonoverlapping subsets and assigns these subsets to different blocks-m subsets are allocated to the training block, and the other k subsets are mutually exclusive to k test blocks. e experiments show that our PSV not only solves the UFP (and AE) but also reduces the time cost by more than 75%. e rest of this study is organized as follows. In Section 2, we introduce the underpredicted false-positives. en, the UFP and AE are analyzed in Section 3. Next, we elaborate on the PSV-based leakage detection in Section 4. Section 5 applies the PSV to high-order leakage detection, and Section 6 summarizes the whole study and draws our conclusions.

Underpredicted False-Positives
In [10], FPR of the ρ CV -test was predicted based on an assumed standard normal distribution. is section shows that the prediction in [10] is not accurate. In this section, we first review the details of the ρ CV -test. en, we introduce our estimation algorithm and describe the underpredicted false-positives.

Correlation-Based Leakage Detection.
e correlationbased leakage detection (ρ-test) takes advantage of a ρ-statistic r z to identify the plaintext-dependent leakage by comparing the absolute value |r z | to a detection threshold T det . In [10], Durvaux et al. utilized the k-fold CV to estimate the statistic r z (τ) at time point τ as follows.
First, an assessor randomly splits the acquired traces L into k nonoverlapping subsets L (i) of approximately the same size.
en, for j th (1 ≤ j ≤ k) fold of CV, the assessor defines the profiling block L (j) p : � L/L (j) and the test block L (j) t : � L (j) . us, a CV-based model model , where X is the associated data.
Next, compute the Pearson correlation coefficient between the model model where X (j) t is the data associated with the test leakages L (j) t . An unbiased estimate r CV (τ) of the correlation coefficient is obtained by combining the k coefficients r (1) CV , . . . , r (k) CV : Finally, after Fisher's z-transformation and normalization, the CV-based estimate of r z (τ) is where N is the size of the set L. Based on the assumption that r z,CV can be interpreted by a normal distribution N(r z , 1), the FPR of (a single) the ρ CV -test can be predicted by where CDF N(0,1) (·) is the cumulative function of the standard normal distribution N(0, 1).

Statistical Estimation of FPR.
In statistical estimation, FPR is calculated as the ratio of the number N FP of falsepositives to the number N AN of actual negatives: However, neither N FP nor N AN is known to the evaluators in leakage detection. erefore, we define a dataindependent leakage set (DILSET) to statistically approximate the true FPR of the ρ CV -test. DILSET is a leakage set, where each leakage L is independent of the associated data X ′ , L╨X ′ . us, r z (τ) ≡ 0, and all the L are negatives for the ρ CV -test. By rewriting (6), FPR of the ρ CV -test can be estimated: where N ρ is the number of ρ CV -tests on DILSET, and N |r z |>T det is the number of (false) positives |r z | > T det in the N ρ test results. e statistical estimation is formally described in Algorithm 1. Taking the trace set L and the data vector X as input, the algorithm first creates a DILSET L, X ′ , where the ρ-statistic r z ≡ 0. In each loop of Algorithm 1, the function GenLeak(L) generates a leakage vector L based on L, and GFRn d(X) produces a random plaintext vector X ′ on the Galois field GF(X), so that X ′ ╨L. en, the function CorrTest(·) performs the ρ CV -test on L, X ′ and returns an estimate r z,CV to the vector r z . After N ρ loops, the algorithm counts the number of |r z,CV | > T det in r z and outputs the ratio α as the estimated FPR (EFPR) of the ρ CV -test. According to the law of large numbers, α will converge to the true FPR of the ρ CV -test as N ρ increases.

Assumption-Based Prediction vs. Statistical Estimation.
In this subsection, we run Algorithm 1 on three different DILSETs to check the prediction of FPR in [10] (i.e., equation (5)).

Data-Independent Leakage Set.
e three DILSETs used to check the prediction are constructed as follows.
DILSET_1 is a simulated DILSET where all the leakages are Gaussian noise. Accordingly, in each loop of Algorithm 1, the function GenLeak(·) generates a leakage vector L composed of 5, 000 random leakages l ∼ N(0, 1), and GFRn d(·) randomly produces 5, 000 8 bit variables to form the associated data vector X ′ , X ′ ╨L.
DILSET_2 and DILSET_3 are two measured DILSETs where all the leakages come from DPA contests [18]. In each loop of Algorithm 1, the function GenLeak(L) randomly selects a time point τ from L and returns the leakages L(τ) at τ as L, and GFRn d(·) produces 5, 000 8 bit random variables to form X ′ . e DPA contests are detailed.
DPA contest v2 targets an unprotected FPGA implementation of the advanced encryption standard (AES) algorithm and provides three databases for side-channel evaluation [19]. In this study, we use the first 5, 000 traces in the template base to build DILSET_2.
DPA contest v4.2 uses a rotating SBOX masking to protect the AES implementation on a smart card and collects 5,000 traces in each zip file [20]. is study decompresses the first zip file, corresponding to the SHA1sums: f711206b413b8d02f595d5861996ff61a1711f3d, to build DILSET_3.

Parameter Setting.
e number of folds k has an impact on the bias and variance of the CV-based estimation. A larger value of k gives a less biased estimate but means a higher procedure's variance. In [21], James et al. showed that k � 5 and k � 10 make a good trade-off between the bias and the variance. We choose 5, 10, 15, and 20 as the candidates for k to support the universality of our work. e detection threshold T det determines the false-positives in leakage detection. Goodwill et al. suggested a threshold of 4.5 for TVLA [22]. However, as the number of points (or tests) increases, the test statistics may be so large that a leak-free device cannot pass the test with T det � 4.5.
erefore, T det � 5.0 was recommended for longer traces in [9] and adopted as the threshold of the ρ CV -test in [10]. In our experiments, we set T det to the empirical values of 4.5 and 5.0, respectively.
A larger number of ρ CV -tests N ρ improves the estimation accuracy of FPR, but means more running time of Algorithm 1. Because α 4.5 � 6.7953 × 10 − 6 and α 5.0 � 5.733 × 10 − 7 , we set N ρ � 10 7 to make a trade-off between the estimation accuracy and the time overhead.

Root Cause Analysis
Root cause analysis (RCA) identifies the root cause of UFP, which helps prevent the underprediction from recurring.
is section uses hypothesis test tools to determine the root cause of UFP. In this section, we first introduce the test tools and then analyze the UFP and the AE in [10].

Hypothesis Test Tools.
e tests used for our root cause analysis (RCA) are given in Table 2.

One-Sample t-Test.
e one-sample t-test compares the sample mean to a prespecified value to test for the deviation from this value. e null hypothesis of the test assumes that no difference exists between the true mean μ and the comparison value μ, while the alternative assumes that the difference exists. e test statistic for a one-sample t -test is denoted by the letter t, which is calculated using the following formula: σ is the sample standard deviation, and n is the sample size. en, the p value of the t-test can be computed:

Security and Communication Networks
where CDF t (·) is the cumulative function of Student's t-distribution, and v � n − 1 is the degree of freedom. As v increases, Student's t-distribution gets close to a standard normal distribution N(0, 1) [23].

Chi-Squared Test.
e chi-squared test (χ 2 -test) can be used to test whether the true variance σ 2 is equal to a specified value σ 2 . Assuming σ 2 � σ 2 , the test statistic of the χ 2 -test is e p value of the null hypothesis is where CDF x 2 (·) is the cumulative function of a χ 2 distribution: where Γ(·) denotes the gamma function [23].

Kolmogorov-Smirnov
Test. e Kolmogorov-Smirnov test (KS-test) is a nonparametric test that can be used to compare a sample with a reference probability distribution. Its null hypothesis assumes that the samples come from a specified distribution Θ. In this test, an empirical cumulative function F n for n independent and identically distributed ordered observations X i is defined as where sup| · | is the supremum of the set of distances. If the sample comes from Θ, D n converges to zero in the limit when n goes to infinity. Its p value, that is, the probability that the null hypothesis holds can be calculated regarding [24].

Pearson's Correlation Test.
Pearson's correlation test is used to test whether there is a relationship between two variables. Its null hypothesis assumes no correlation between the observed phenomena ρ � 0. Given an estimate ρ of ρ, the test statistic can be estimated: e p value of the test is determined by Student's t-distribution with degrees n − 2 of freedom [25]. Note that Pearson's correlation test is different in definition from the ρ-test in Section 2.1. Figure 1, the UFP implies a nonnegligible error between the true distribution of r z,CV and the assumed distribution N(0, 1). Regarding the description in [26], we name this error the assumption error in leakage detection and use the p values of the tests on r z,CV to quantify the AE-the smaller the p value, Input: Trace set L, Data vector X, Number of CV folds k, Detection threshold T det , Number of ρ-tests N ρ . Output: Estimate of FPR α. for i←1: N ρ L←GenLeak(L);

Error in Assumed Distribution. As shown in
ALGORITHM 1: Statistical estimation of FPR.
the stronger the null hypothesis should be rejected and the more significant the AE in the ρ CV -test. e results of the hypothesis tests are given in Table 3. At the significance level of 0.01 (an accepted threshold in [27]), the p values of t-tests exceed 0.01, which means the null hypothesis μ � 0 is accepted. In contrast, the p values of χ 2 -tests and KS-tests are less than 0.01, that is, the null hypotheses σ 2 � 1 and Θ ∼ N(0, 1) are rejected. Hence, it is confirmed that the assumed N(0, 1) is not the true distribution of r z,CV .

Correlation between Cross-Validation Blocks.
To determine the root cause of a problem, RCA establishes an event timeline from the normal situation up to the time the problem occurs. Based on equations (1)-(4), we create the timeline from L input to UFP (or AE) occurrence, as shown in Figure 2. To facilitate RCA, intrablock estimate r (j) z,CV is obtained by performing Fisher's z-transform and normalization on r where N/k is the number of traces in the test block L z,CV pass the tests for N(0, 1), which excludes (1) and (2) from the possible source of AE. In [28], Bengio et al. proved that between-block correlation leads to a biased variance of the CV-based estimation (in equation (3)). erefore, we utilize Pearson's correlation test to examine the correlation between (r (j 1 ) z,CV , r (j 2 ) z,CV ) (j 1 ≠ j 2 ). In the case of DILSET_1 and k � 5, the results of correlation tests are given in Table 5. Since p ≡ 0, the null hypothesis ρ � 0 is rejected. e correlation between CV blocks is confirmed and may be the cause of the AE in the ρ CV -test. z,CV , r k�10 z,CV , r k�15 z,CV , and r k�20 z,CV roughly coincide but deviate from the assumed distribution N(0, 1).   [28], the between-block correlation in crossvalidation stems from the overlap between the training and the test blocks. In this subsection, we present a nonoverlapping validation (NOV) and compare the NOV-based ρ-test (ρ NOV -test) with the ρ CV -test to check the impact of the overlap on FPR. e ρ NOV -test works as follows. First, the traces L are split into k nonoverlapping blocks L (j) of approximately the same size. For each block L (j) , 1/m of the traces are randomly selected as the test subset L (j) t , and the others are left as the profiling subset L (j) p . en, the model model Finally, an NOV-based estimate r z,NOV of ρ-statistic can be calculated by equations (18)- (20): We run Algorithm 1 on an extended DILSET_1 (ExDILSET_1) to approximate the true FPR of the ρ NOV -test. Specifically, in each loop of the algorithm, the function GenLeak(·) produces 50, 000 random leakages l ∼ N(0, 1) to form the leakage vector L, and GFRn d(·) randomly generates a vector X ′ composed of 50, 000 8 bit variables, so that X ′ ╨L. en, the function CorrTest(·) performs the ρ NOV -test on L, X ′ and returns an NOV-based estimate r z,NOV . After 10, 000, 000 loops, 10, 000, 000 r (j) z,NOV (for each j) and 10, 000, 000 r z,NOV are obtained. Finally, Algorithm 1 counts the number of false-positives and outputs the EFPR of the ρ NOV -test.
e results of Pearson's correlation test between (r (j 1 ) z,NOV , r (j 2 ) z,NOV ) (m � 10, k � 5) are given in Table 6. At the significance level of 0.01, all the pairs pass Pearson's correlation test, which means that there is no correlation between the NOV blocks. Table 7 provides the p values of the distribution tests on r z,NOV and the EFPRs of ρ NOV -test. On the one hand, since p > 0.01, r z,NOV pass the test for the distribution N(0, 1). On the other hand, compared with the EFPRs of the ρ CV -test (Table 1), EFPRs of the ρ NOV -test are much closer to PFPRs α 4.5 � 6.7953 × 10 − 6 and α 5.0 � 5.733 × 10 − 7 . erefore, both the AE and the UFP are solved by eliminating the between-block overlap. It is concluded that the overlap between the training and test blocks is the root cause of AE and UFP in the ρ CV -test.

Improved ρ-Test
Section 3 has shown that NOV is an effective solution to the UFP in the ρ CV -test, but requires more samples to construct the k nonoverlapping blocks. In this section, we introduce the profiling-shared validation to efficiently solve the UFP.

Profiling-Shared Validation.
Profiling-shared validation shares the same profiling samples and the same profiled model for all the test blocks. In a PSV-based ρ-test (ρ PSV -test), the evaluator randomly splits the traces into m + k nonoverlapping subsets L (i) with approximately same size. Choosing the first m subsets as the profiling block L p : � L (i) |1 ≤ i ≤ m , the model at time point τ can be profiled: en, for j th test block L (j) t : � L (j+m) , 1 ≤ j ≤ k, the Pearson correlation coefficient between the model and the test leakages L (j) Next, after Fisher's z-transformation and normalization, the intrablock estimates r Finally, averaging the intrablock estimates r (j) z,PSV and then normalizing the mean to obtain a PSV-based estimate r z,PSV (τ) of ρ-statistic,   Table 8. At the significance level of 0.01, the null hypothesis ρ � 0 holds. Table 9 provides the p values of the distribution tests on r z,PSV and EFPRs of the ρ PSV -test. Since all the p values exceed the acceptable level of 0.01, the PSV-based estimates r z,PSV pass the test for N(0, 1). In addition, compared to EFPRs of the ρ CV -test (Table 1), EFPRs of ρ PSV -tests are much closer to PFPRs. Hence, PSV is an effective solution to the AE and UFP.

Efficiency.
Different from the CV which requires k-profiled models for the validation, PSV shares the same model for all test blocks, which means the ρ PSV -test spends less time in the profiling phase than the ρ CV -test. We compare the execution time of the ρ PSV -test and ρ CV -test on an HP EliteBook 735 G6 (AMD Ryzen 5 PRO 3500U CPU @ 2.1 GHz, 8 GB RAM) with Windows 10 as the operating system. e time overhead of 1,000,000 tests on DILSET_1 is shown in Figure 3. Compared with the time cost of 1,000,000 ρ CV -tests of at least 676.94 seconds when k � 5, the maximum time cost of 1,000,000 ρ PSV -tests is 144.98 seconds when (m, k) � (10, 5), saving about 79% of the time costs. In short, the ρ PSV -test is more time-efficient than the ρ CV -test.

Measured Experiments.
We verify whether the ρ PSV -test can identify the plaintext-dependent leakages in the captured power traces. In our measured experiments, the ρ PSV -test assesses the dependency between the leakages and first 4 plaintext bytes in DPA contest v4.2. Let

Higher-Order Leakage Detection
Higher-order leakage detection evaluates the security of the protected implementation against higher-order side-channel analysis. In this section, we analyze the performance of the PSV method in detecting higher-order leakages.

Combining Function.
In higher-order side-channel analysis, a combining function maps the leakages of multiple shares to a simple univariate leakage. e work of Prouff et al. has demonstrated the central product function: where L i (τ j ) is the i th leakage at time point τ j in trace set L and is optimal for the Hamming weight leakage scenario represented by the smart card platform [29]. In our experiments, the central product function is selected to preprocess the leakages of the masking. According to the classification in [30], we analyze the performance of PSV under the following combining functions: (ExDILSET_1, m � 10, k � 5).
(2) e central product function with d � 1 and o � 3, (3) e central product function with d � 2 and o � 2, (4) e central product function is with d � 3 and o � 3:     follows. In addition to the vector X ′ , two sets of leakages l ∼ N(0, 1) are randomly generated, so that L(τ 1 )╨L(τ 2 )╨X ′ . en, the leakages L(τ 1 ) and L(τ 2 ) are combined by the central product function (28). Finally, we perform the ρ CV -test or ρ PSV -test on the combined leakages to obtain r z,CV or r z,PSV . For each (o, d), Table 10 provides the results of the distribution tests on r z,CV and r z,PSV . At the significance level of 0.01, r z,CV is rejected by the test of σ 2 � 1 and N(0, 1), while r z,PSV passes. It is proved that the AE and UFP in the higher-order ρ CV -test can be solved by the PSV method.

Efficiency.
e time costs of 1,000,000 2 nd -order ρ-tests are shown in Figure 5. As shown in Figure 5(a), compared with the time cost of 1,000,000 univariate 2 ndorder ρ CV -tests of at least 743.85 seconds, the maximum time cost of the 1,000,000 ρ PSV -test is 161.31 seconds, saving about 78% of the time overhead of detections. In Figure 5(b), compared with the minimum time cost of the bivariate 2 nd -order ρ CV -test of 767.80 seconds, the ρ PSV -test takes at most 187.23 seconds, saving about 76% of the time overhead. In other words, the higher-order ρ PSV -test is more time-efficient than the higher-order ρ CV -test.

Measured Experiments.
We use (26) to preprocess DPA contest v4.2 and run the ρ PSV -test on the preprocessed traces. Figure 6 shows the results of the ρ PSV -test between the traces and the first 4 plaintext bytes. When T det � 5.    [382.49 μs, 383.59 μs], which means that the univariate 2 ndorder leakages in the measured power consumption are identified by the ρ PSV -test.

Conclusions
Assumption error (AE) invalidates side-channel security assessment. is study finds that the false-positive rate of leakage detection might be mispredicted due to potential errors between the assumed distribution and the true distribution of the estimated test statistics. We notice underpredicted false-positive (UFP) in [10].
is underprediction, interpreted as the AE in the statistical distribution of the estimates of ρ-statistics, is caused by the overlap between the training and the test blocks in crossvalidation. In addition, we propose the profiling-shared validation (PSV) to improve the detection of any-variate any-order leakages and show that the UFP and AE can be addressed by eliminating the between-block overlap. Compared with the ρ CV -test, our ρ PSV -test overcomes the UFP and only takes less than 25% of the time cost. To the best of our knowledge, this article is the first empirical study of the false-positives in leakage detection. In future work, we will refine our tools, including the estimation algorithm and the distribution tests, to preevaluate other security assessments.
Data Availability e datasets and codes used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest.