High-Efficiency Min-Entropy Estimation Based on Neural Network for Random Number Generators

,


Introduction
Random number generator (RNG) is a fundamental and important element in modern cryptography, which especially provides a basic guarantee for the security of the network and communication system, as in [1][2][3][4]. e output of RNGs, called random number, is widely used in a large number of security and cryptographic applications. ese applications include the generation of cryptographic keys, initialization vectors in cryptographic algorithm, digital signature generation, and nonces and padding values. If the output of RNGs cannot provide sufficient unpredictability as expected, the cryptographic applications would be vulnerable, as in [5][6][7]. us, the necessity of the security analysis on RNGs is self-evident, especially it is important to evaluate the quality of the entropy source which is the main source of randomness for RNGs.
At present, in order to guide the designers, users, and assessors to analyze the security of RNGs, many research organizations or individuals have provided a number of approaches for testing and evaluating the RNGs. ese approaches can be roughly divided into two classes: statistical property test and entropy estimation. Specifically, the statistical property test is proposed at first, such as the NIST Special Publication 800-22 [8], AIS 31 [9], Diehard battery [10], and TestU01 [11], which detects whether the output sequence has obvious statistical defects. Because it only focuses on the statistical properties of outputs rather than the internal structure and generation principle of RNGs, the statistical property test is a universal (black-box) testing method for various types of generators, and it is easy to operate. With in-depth understanding for the randomness in the past few years, the concept of "entropy" has been proposed to evaluate the security of RNGs. Entropy is a measure of uncertainty that is appropriate to reflect the amount of randomness. erefore, several major standardization organizations' criterions recommend to adopt the entropy to quantify the randomness (unpredictability) of the outputs of RNGs, such as the ISO/IEC 18031 [12] and AIS 31 [9]. ere are many types of methods for measuring entropy, including Shannon entropy, Rényi entropy, and min-entropy. Min-entropy is a very conservative measure, which means the difficulty of guessing the most-likely output of entropy sources [13].
However, the entropy estimation of entropy sources is a very challenging task because the common assumptions may not be consistent with the real conditions and the distribution of the outputs is unknown. Nowadays, there are two ways to implement entropy estimation: theoretical entropy estimation (stochastic model) and statistical entropy estimation. A theoretical proof for the security of RNGs can be achieved from a suitable stochastic model, as in [14][15][16][17]. But the modeling is always difficult and complex, because it is always based on the specific structure of a RNG and an appropriate assumption on the entropy source's behavior, and even some structures of RNGs still do not have a suitable model [18][19][20][21]. Relatively, statistical entropy estimation is still based on the idea of entropy estimation, but it is implemented by means of statistical black-box testing, which has a good applicability for evaluating various types of RNGs. us, statistical entropy estimation can partly solve the problem that the entropy of some RNGs cannot be quantified by modeling. e NIST Special Publication 800-90B [22] (called 90B in the text below) is a typical representative of the statistical entropy estimations, which is based on min-entropy and specifies how to design and test entropy sources. e final version of the 90B was officially published in January 2018 [23] and replaces the second draft of 90B published in 2016 [22]. e predictors proposed by Kelsey et al. [24] have a better performance than the other estimators in this standard and refer to a machine learning algorithm that attempts to predict each sample in a sequence and updates its internal state based on all observed samples. However, there are some problems in these 90B's predictors. On the one hand, every predictor is designed to perform well only for the sources with a certain statistical property as stated in [24], which constrains the application scope of the predictors. On the other hand, the execution efficiency of these predictors is influenced significantly when the selected sample space is large. It is analyzed that the time complexity of 90B's predictors has high order linear relationship with the size of sample space. In the released C++ code of 90B's estimators published in 2018, the bits per symbol of the samples are still limited to between 1 and 8, inclusive, in order to prevent too low execution efficiency. erefore, they are not likely to be well applied to the entropy evaluation of entropy sources with unknown statistical behaviors, multivariate, and longrange correlation.
As we know, the output sequences of RNGs are also a type of time series. Fortunately, there has been increasing attention to using neural networks to model and forecast time series, which has been found to be an alternative method when compared with various traditional time series models [25][26][27]. Some specific neural networks are applicable to the prediction of time series via approximating the probability distribution function (PDF), and the time complexity varies linearly with the sample space. Feedforward neural networks (FNNs) and recurrent neural networks (RNNs) are the typical representatives. FNNs are the quintessential deep learning models. In 1991, de Groot and Würtz [28] presented a detailed analysis of univariate time series forecasting using FNNs for two nonlinear time series. FNNs are used to approximate some PDFs [29]. RNNs are a family of neural networks for processing sequential data, which can also be used for time series forecasting [30][31][32]. erefore, it is worthwhile and feasible to study new methods of entropy estimation for RNG based on neural networks.

Motivation.
In this paper, we aim to propose several suitable and efficient predictors for the security evaluation of RNGs, especially for the min-entropy estimation of sequences generated by entropy sources. Since both FNN and RNN are suitable for time series prediction, we design two predictors based on these neural network models. e advantages of the proposed predictors are roughly described as follows. Because the selected neural network models have good universality for the prediction of various types of time series in principle, the designed predictors have wide applicability for the entropy estimation of the output of entropy sources. Moreover, neural network-based predictors have high execution efficiency, which can properly handle the difficulty in evaluating random numbers with long dependence and multivariate due to the huge time complexity.

Contributions.
In summary, we make the following contributions: (i) We are the first to adopt neural network models to design predictors to estimate min-entropy for RNG and propose a suitable execution strategy which makes our approach applicable to predict both stationary and nonstationary sequences generated by different entropy sources. (ii) We conduct a series of experiments to verify the accuracy of our predictors by using many typical simulated sources where the theoretical entropy can be obtained from the known probability distribution. Additionally, the computational complexity are evaluated theoretically. e results show that our approaches enable the entropy to be estimated with an error of less than 6.55%, and the error is up to 14.65% for 90B's predictors. e time complexity of our estimation is a linear relationship with sample space, which is high-order linear relationship with sample space for the 90B. (iii) We experimentally compare the advantages of our predictors over 90B's predictors on accuracy, application scope, and execution efficiency. e experimental datasets include several typical realworld data sources and various simulated datasets. e experimental results indicate our predictors have higher accuracy, execution efficiency, and wider scope of applicability than those of 90B's predictors. Furthermore, when the test sample space and sample size are continuously growing, the execution efficiency of 90B's predictors becomes too low to estimate the entropy within an acceptable time interval, while our proposed predictors can still calculate the estimated results efficiently. e rest of the paper is organized as follows. In Section 2, we introduce fundamental definitions about min-entropy, the evolution of the 90B and estimators especially predictors defined in this criterion, and two typical neural networks we choose to support our research. In Section 3, we propose two predictors based on neural networks for min-entropy estimation, design an execution strategy, and give the accuracy verification and complexity analysis of our predictors. Furthermore, we apply our predictors to different types of simulated and real-world data sources and compare the advantages of our predictors over 90B's predictors on accuracy, application scope, and execution efficiency in Section 4. We finally conclude our work in Section 5.

Preliminaries
In this section, firstly, we introduce the fundamental concept of min-entropy, which is the core mathematical thought and assessment method in our work. en, we introduce the evolution process of the 90B and relevant research work on this criterion. After that, we introduce the estimators defined in the 90B, especially the predictive model-based estimators, which are the focus of this paper. At last, we describe two predictive models based on neural networks, which apply to time series forecasting and contribute to the design of new predictors in our work.

Min-Entropy of Entropy Source.
e concept of entropy is the core mathematical thought of the 90B and min-entropy is the assessment method, which is a conservative way to ensure the quality of random numbers in the worst case for some high-security applications, such as the seed of PRNGs. e 90B [22] gives the definition of min-entropy: the min-entropy of an independent discrete random variable X that takes values from the set A � x 1 , x 2 , . . . , x k (k ∈ Z * denotes the size of sample space), with the probability e min-entropy of the output is If X has min-entropy H, then the probability of observing any particular value for X is no greater than 2 − H . e maximum possible value for the min-entropy of a random variable with k distinct values is log 2 (k), which is attained when the random variable has a uniform probability distribution, namely, p 1 � p 2 � · · · � p k � 1/k.
For the non-IID source, such as Markov process, Turan et al. provided a calculation method of min-entropy in [22]. A stochastic process X i i∈N that takes values from the finite set A defined above is known as a first-order Markov chain, if for any m ∈ Z * and all x 0 , x 1 , . . . , x m , x m+1 ∈ A. In a d th -order Markov process, the transition probabilities have the property that (3) e initial probabilities of the process are p i � Pr(X 0 � i), and the transition probabilities are p ij � Pr(X m+1 � j | X m � i). e min-entropy of a Markov process of length L is defined as e approximate value of min-entropy per sample can be obtained by dividing H min by L.

NIST SP 800-90B and Its Entropy Estimation.
e 90B is a typical case that evaluates the quality of the entropy source from the perspective of min-entropy. e evolution process of 90B mainly includes the following three stages. Compared with the second draft, the final version in January 2018 has made some corrections.
(i) e first draft of 90B [13] was published in August 2012, which included five estimators proposed by Hagerty and Draper [33]. ese estimators are collision test, partial collection test, Markov test, compression test, and frequency test, which are suitable for sources that do not necessarily satisfy the IID assumption. But these estimators give significant underestimates which were found by Kelsey et al. [24] through experiments. (ii) Subsequently, the 90B was updated to the second draft [22] in January 2016, and the estimators based on predictors which were proposed by Kelsey et al. for the first time were adopted. Compared with the first draft, the second draft has the following main changes. (1) Among these estimators, the partial collection test in the first draft was deleted, and the frequency test in the first draft was replaced by most common value estimator, and two new estimators including t-tuple estimator and longest repeated substring estimator were added. (2) e second important update was the addition of four predictors for entropy estimation. However, the underestimation problem was not solved in the second draft. Zhu et al. [34] proved the underestimation problem for non-IID data from theoretical analysis and experimental validations and proposed an improved method. (iii) e final official version of 90B [23] was published in January 2018. In the final 90B, estimators with significant underestimates, such as collision estimator and compression estimator, are modified to be limited to only for binary inputs, which may reduce the overall execution efficiency of min-entropy estimation for nonbinary inputs. In addition, the calculation process and method of key variables for min-entropy estimation are also corrected, such as P global ′ , min-entropy of each predictor.

Execution Strategy of Min-Entropy Estimation in 90B.
e 90B takes the following strategy to estimate the minentropy. It first checks whether the tested datasets are IID or not. On the one hand, if the tested datasets are non-IID, there are ten estimators as mentioned in Section 2.2.2 for entropy estimation. Each estimator calculates its own estimation independently, then among all estimations, the minimum one is selected as the final estimation result for the entropy source. On the other hand, if the tested datasets are considered IID, only the most common value estimator is employed. Finally, it applies restart tests and gives the entropy estimation.
Note that, this article only focuses on the analysis and comparison with the 90B's four predictors, and the research on other parts of the 90B is not considered in this study.

Estimators in 90B.
In the final NIST SP 800-90B, there are ten estimators, and each estimator has its own specific characteristics. According to the underlying methods they employed, we divide these estimators into three classes: frequency-based type, entropy statistic based type, and predictors based type. e following is a brief introduction of the ten estimators, and the details can be found in [23].
(1) Predictor-Based Type Estimators. e followings are four predictors proposed by Kelsey et al. [24] for entropy estimation for the first time. Kelsey et al. utilized several machine learning models served as predictors to improve the accuracy of entropy estimation. But these predictors perform well only for specific distributions: Predictor. is predictor performs well in cases where there is a clear most common value, but that value varies over time.
e Lag subpredictor predicts the value that occurred N samples back in the sequence.
is predictor performs well on sources with strong periodic behavior, if N is close to the period. Predictor. e MultiMMC subpredictor predicts the most common value followed the previous N sample string. e range of the parameter N is set from 1 to 16. is predictor performs well on data from any process that can be accurately modeled by an N th -order Markov model. (iv) LZ78Y Predictor. is predictor performs well on the sort of data that would be efficiently compressed by LZ78-like compression algorithms.
(2) Other Estimators. Four frequency-based type estimators and two entropy statistic-based type estimators are described as follows. e min-entropy of the former type of estimators is calculated according to the probability of the most-likely output value, and the other is based on entropic statistics presented by Hagerty and Draper [33]. Among them, three estimators, including Markov estimator, Collision estimator, and Compression estimator, explicitly state that they only apply to binary inputs in the final 90B published in 2018, which may reduce the execution efficiency for nonbinary inputs as proved through experiments in Section 4.3.
(i) Most Common Value Estimate. is estimator calculates entropy based on the number of occurrences of the most common value in the input dataset and then constructs a confidence interval for this proportion. e upper bound of the confidence interval is used to estimate the min-entropy per sample of the source. (ii) Markov Estimate. is estimator computes entropy by modeling the noise source outputs as a first-order Markov model. e Markov estimate provides a min-entropy estimate by measuring the dependencies between consecutive values from the input dataset.
is method is only applied to binary inputs.
is method examines the frequency of t-tuples (pairs, triples, etc.) that appears in the input dataset and produces an estimate of the entropy per sample, based on the frequency of those ttuples.

(iv) Longest Repeated Substring Estimate (LRS estimate).
is method estimates the collision entropy (sampling without replacement) of the source, based on the number of repeated substrings (tuples) within the input dataset.
(v) Collision Estimate. is estimator calculates entropy based on the mean number of samples to see the first collision in a dataset, where a collision is any repeated value. is method is only applied to binary inputs. vi) Compression estimate.
is estimator calculates entropy based on how much the tested data can be compressed.
is method is also only applied to binary inputs.

Min-Entropy Estimation of 90B's Predictors.
Each predictor in the 90B attempts to predict the next sample in a sequence according to a certain statistical property of previous samples and provides an estimated result based on the probability of successful prediction. Every predictor consists of a set of subpredictors and chooses the subpredictor with the highest rate of successful predictions to predict the subsequent output. As for each predictor, it calculates the global predictability and local predictability with the upper bound of the 99% confidence interval and then derives the global and the local entropy estimations, respectively. Finally, the final entropy estimation for this predictor is the minimum of the global and the local entropy estimations.
For estimating the entropy of a given entropy source, each predictor offers a predicted result after testing the outputs produced by the source and provides an entropy estimation based on the probability of successful predictions. After obtaining the estimations from the predictors, the minimum estimation of all the predictors is taken as the final entropy estimation of the entropy source.
e entropy estimation will be too loose, if there is no predictor applied to detect the predictable behaviors. But if a set of predictors with different approaches are applied, they can guarantee that the predictor which is the most effective at predicting the entropy source's outputs determines the entropy estimation.

Two Predictive Models Based on Neural Networks.
Next, we will introduce two main neural network models to help us design predictors for entropy estimation, feedforward neural networks (FNNs), and recurrent neural networks (RNNs), respectively.

FNNs.
e goal of a feedforward network is to approximate some function f * . For an instance, a classifier Y � f * (X) maps an input X to a category Y. A feedforward network describes a mapping Y � f(X; θ) and learns the value of the parameters θ that result in the best function approximation. e principle of FNN is depicted in Figure 1.
For each time step from t � 1 to t � n (n ∈ Z * denotes the sample size), the FNN applies the following forward propagation equations: e parameters and functions that govern the computation happening in a FNN are described as follows: (i) X t is the input at time step t and is a vector composed of the previous inputs (i.e., where k refers to the step of memory). (ii) H t is the hidden state at time step t, where the bias vectors b and input-to-hidden weights W are derived via training. e number of hidden layers and the number of hidden nodes per layer are defined before training, which called hyperparameters in neural networks. (iii) Y t is the output at step t, where the bias vectors c and hidden-to-output weights V are derived via training. (iv) Y t is our predictive output at time step t, which would be a vector of probabilities across our sample space. (v) e function f(·) is a fixed nonlinear function called activation function, and the function g(·) is an output function used in the final layer of a neural network. Both of the two functions belong to hyperparameters which are defined before training (Section 3.2). e models are called feedforward because the information flows through the approximate function from the input X t , through the internal computations used to update the model to define f(·), and to the output Y t finally. Besides, there is no feedback connection, namely, the outputs of the model are fed back into itself.

RNNs.
If adding the feedback connections to the network, then it is called RNNs. In particular, the RNN records the information that has been calculated so far and use it for the calculation of the present output. e principle of RNNs is depicted in Figure 2.
For each time step from t � 1 to t � n, the RNN applies the following forward propagation equations: e parameters and functions that govern the computation happening in a RNN are described as follows: Output layer Input layer Hidden layer

Security and Communication Networks
(i) X t is the input at time step t and is one-hot vector. For example, if X t � 1 and the sample space H t is the hidden state at time step t. It is the "memory" of the network. H t is calculated based on the previous hidden state H t− 1 and the input at the current step X t . b, U, and W denote the bias vectors, input-to-hidden weights, and hidden-to-hidden connection into the RNN cell, respectively. (iii) Y t is the output at step t. c and V denote the bias vectors and hidden-to-output weights, respectively. (iv) Y t is our predictive output at time step t, which would be a vector of probabilities across our sample space. (v) Similarly, the function f(·) is an activation function and g(·) is an output function, which are defined before training (Section 3.2).

Predictors for Min-Entropy Estimation Based on Neural Network
e neural network is able to approximate the various PDFs, and the complexity of neural network is increased slower (linear relationship) as the sample space increases. Motivated by [24], we propose two predictive models based on neural networks for min-entropy estimation. Next, we present the execution strategy of our min-entropy estimators, provide the choices of the important hyperparameters, and give the analysis on the accuracy and complexity of our predictive models to prove that our design is feasible.

Strategy of Our Predictors for Min-Entropy Estimation.
e execution strategy of our min-entropy estimator is depicted in Figure 3, which consists of model training and entropy estimation. Both of our proposed two predictive models (namely predictors), which are based on FNN and RNN, respectively, follow the same strategy. e benefit of this strategy is that it applies not only to stationary sequences generated by entropy sources but also to nonstationary sequences of which the probability distribution is time-varying. On the one hand, in our strategy, in order that the model can match the statistical behavior of the data source well, we use the whole input dataset to train and continuously update the model. On the other hand, to effectively estimate the entropy of the data source, we use the predictive model to compute the min-entropy only when the predictive model is updated enough to characterize the statistical behavior of the tested dataset. Specifically, for the testing dataset which is used for computing entropy estimation, we preset the testing dataset as a part of the whole observations and utilize a proportion parameter (c ∈ [0, 1]) to determine the size of testing dataset, namely, the last c of the inputs are used for computing entropy while the model is also updating. e workflow of the min-entropy estimator based on neural network is listed as the following steps:   Combining the calculation principle of min-entropy in Section 2.1, we can see the lower bound on the probability of making a correct prediction gives an upper bound on the entropy of the source. In other words, the more predictable a source is, the larger probability of making correct predictions is, and the less entropy it has. erefore, a model that is a bad fit for the source or not fully trained will result in inaccurate predictions, a low accurate prediction probability, and a too-high entropy estimation of the source. So, the models that are bad fit for the source or not fully trained can give big overestimates but not underestimates.
Further, we can confirm that adding one more predictor will not do any harm and conversely will make the entropy estimation much more accurate. From the execution strategy, we can see that if all the predictors whose models are not matched for the noise source are used alongside a predictor whose underlying model matches the source's behavior well, then the predictor which matches the source well will determine the final entropy estimation.

Hyperparameters for FNN and RNN.
In neural networks, the choices of models' hyperparameters have significant influences on the computational resource and performance required to train and test. erefore, the choices of hyperparameters are crucial to neural networks. Next, we illustrate the choices of some key hyperparameters.
(1) Hidden Layers and Nodes. Comprehensively balance the accuracy and efficiency of our predictors; in this paper, for the FNN model, except for the multivariate M-sequences, we set the number of hidden layers as 2 and the number of hidden nodes per layer is 10 and 5, respectively. While for the multivariate M-sequences, after extensive tests, the number of hidden nodes per layer shall be larger to give better results. By observing the results, finally we set the numbers as 35 and 30, respectively.
(2) Step of Memory. e step of memory determines the number of previous samples used for predicting the current output. Generally speaking, the larger the value, the better the performance. However, the computational resources (memory and runtime) increase as the step of memory grows. In this paper, we set the step of memory as 20 by trading off performance and resource. at is to say, as for the FNN, the input at time step t is the previous 20 observed values, and as for the RNN, the hidden layer contains 20 unfolded hidden units.
where p model (y t | x 1 , . . . , x t ) is given by reading the entry for y t from the model's output vector y t . e models are trained to minimize the cross-entropy between the training data and the models' predictions (i.e., equation (7)), which is equivalent to minimizing the mean squared error (i.e., the average of the squares of the errors or deviations).
(4) Learning Rate. Learning rate is a positive scalar determining the size of the step. To control the effective capacity of the model, we need to set the value of learning rate in an appropriate range. e learning rate determines how fast the parameter θ moves to its optimum value. If the learning rate is too large, gradient descent can inadvertently increase rather than decrease the training error, namely, the parameters are likely to cross the optimal value. However, if the learning rate is too small, the training is not only slower but may become permanently stuck with a high training error. So, the learning rate is crucial to the performance of the model. Based on the above analysis, we pick the learning rate approximately on a logarithmic scale, i.e., the learning rate taken within the set 0.1, 0.01, 10 − 3 , 10 − 4 , 10 − 5 . At the beginning of model training, we set the learning rate as larger value to faster reach the optimum value. en, with the number of training increasing, we set the smaller value for not crossing the optimal value. e detailed settings are described in Algorithm 1.
(5) Activation Function. In general, we must use a nonlinear function to describe the features. Most neural networks do so using an affine transformation controlled by learned parameters, followed by a fixed nonlinear function called an activation function. Activation function plays an important role in neural networks. e commonly used activation functions include tanh(·), relu(·), and sigmoid function which is defined as σ(·), i.e., equation (8)) in this paper. Because the sigmoid function is easy to saturate, which causes the gradient to change slowly during training, it is generally no longer used as an activation function except in RNN-LSTM (long short term memory): After many attempts (i.e., we compare the efficiency and performance by means of the exhaustive method manually), we finally choose the tanh(·) and relu(·) as activation functions for FNN and RNN, respectively. ey can be expressed as Compared with σ(·), tanh(·) is symmetrical about the origin. In some cases, this symmetry can give better Security and Communication Networks performance. It compresses the real-valued input to a range of − 1 to 1, and the mean of its output is zero, which makes it converge faster than the σ(·) and reduce the number of iterations. erefore, this is suitable for activation function, and the zero-centered training dataset contributes the convergence speed of model training: where relu(·) is currently a popular activation function. It is linear and gets the activation value which requires the only one threshold. We choose this function based on the following two considerations. On the one hand, it solves vanishing gradient problem of back propagation through time (BPTT) algorithms for the reason that the derivative of relu(·) is 1. On the other hand, it greatly improves the speed of calculation because it only needs to judge whether the input is greater than 0.
(6) Output Function. e output function is used in the final layer of a neural network model. e predictors for time series are considered as a solution to a multiclass classification problem, so we take softmax(·) as the output function, which can be expressed as where s is the size of sample space and softmax(z i ) denotes the probability of the output is z i and satisfies that s i�1 y i � 1, i.e., the sum of the probability of all the outputs is equal to 1. Such networks are commonly trained under a cross-entropy regime (i.e., the loss function mentioned above).

Selection of Testing Dataset Length.
To better estimate the entropy of the data source, the length of the testing dataset is very important for min-entropy estimation for random numbers generated by different types of sources. In reality, most entropy sources are time-varying (namely, nonstationary), which means the probability distribution of the output sequences from the source is changing over time. So, the length of the testing dataset shall be adaptive to the type of the source. erefore, as described in Section 3.1, we utilize c to determine the size of testing dataset. Specifically, in our strategy, for the stationary entropy source, of which the probability distribution of the outputs from the source is not changing over time, the parameter c is preset to 20%.
Relatively, for the nonstationary entropy source, all observation points (namely, c is 100%) need to serve as the testing dataset.
To verify the reasonableness of the c value, we compute the root-mean-squared error (RMSE) of the lowest estimations of our predictors over 80 sequences from the following simulated datasets generated by nonstationary source: (i) Time-Varying Normal Distribution Rounded to Integers. e samples are subject to a normal distribution and rounded to integer values, but the mean of the distribution moves along a sine curve to simulate a time-varying signal.

����������������������
(1/N) N i�1 (H min − H min ) 2 , refers to the arithmetic square root for the mean of the squares of the errors or deviations for each class of simulated sources. Note that, here N indicates the number of test samples, H min indicates the estimated result for each sample, and H min means the theoretical result for each sample. In other words, the smaller the RMSE is, the closer the estimated result is to the theoretical entropy, which indicates the predictor has a better accuracy.
As shown in Table 1, we can see that for the time-varying data source, only when the c is 100% (namely, the entire dataset shall be used for min-entropy estimation), the predictors can give the most accurate results. is means when the probability distribution of data sources is varying with time, the part of the input dataset cannot represent the overall distribution of the input dataset, so the part of the input dataset cannot accurately give the estimation result of the entire input dataset. Besides, for the stationary sources, it is reasonable that c is preset to 20% because the estimated results obtained by our method are very close to the correct entropy (in theory) of the selected entropy source as presented in Section 4.1.

Evaluation on Our Predictors.
In this section, we conduct some experiments on simulated datasets to verify the accuracy of our proposed predictors for the min-entropy estimation and compare the experimental results with theoretical results. In addition, we have a theoretical analysis of the complexity of our predictors. Note that, in Section 4, we will apply our predictors to different data sources and provide the comparison on our predictors with 90B's predictors.

Accuracy Verification.
We train our predictive models FNN and RNN on a number of representative simulated data sources (including stationary and nonstationary entropy sources), of which the theoretical entropy can be obtained from the known probability distribution of the outputs. Simulated datasets are produced using the following distribution families adopted in [24]: (1) Simulated Datasets Generated by Stationary Sources.  (ii) Discrete Near-Uniform Distribution. All samples are equally likely except one, which come from an IID source. A certain sample has a higher probability than the rest. (iii) Normal Distribution Rounded to Integers. e samples are subject to a normal distribution and rounded to integer values, which come from an IID source. (iv) Markov Model. e samples are generated using a d th -order Markov model, which come from a non-IID source.
(2) Simulated Datasets Generated by Nonstationary Sources. ese datasets are the same as those used in Section 3.2.2. For every class listed above, we generate a set of 80 simulated datasets, each of which contains 10 6 samples, and estimate min-entropy by using predictive models FNN and RNN, respectively. For each dataset, the theoretical minentropy H min is derived from the known probability distribution.
From Figures 4-9, the abscissa in the figure represents the theoretical entropy of the test sample, and the ordinate represents the estimated entropy of the test sample. Figure 4 shows the estimated entropy results for the 80 simulated datasets with uniform and near-uniform distributions, respectively. From Figures 4(a) and 4(b), we see that the estimated results given by our proposed two predictive models (FNN and RNN) are almost consistent with the theoretical entropy for both uniform and near-uniform distributions. So, the final estimated result which is the minimum result of the two predictive models is also basically consistent with the theoretical entropy. Figure 5 shows the estimated entropy results for the 80 simulated datasets with normal distributions and time-varying normal distributions, respectively. From Figures 5(a) and 5(b), we can see that the estimated results given by our proposed two predictive models are close to the theoretical entropy with normal distributions and time-varying normal distributions. According to our execution strategy, here we calculate min-entropy estimations using the whole input dataset for time-varying normal distributions. Figure 6 shows the estimated results of Markov distributions; we can see that both of our predictive models give a number of overestimates when applied to the Markov sources, particularly with the theoretical entropy increasing. Table 2 shows the relative errors (namely, | (H min − H min )/H min | * 100%) between the theoretical results and the estimated results of FNN and RNN to further reflect the accuracy of the models. H min and H min have the same meaning as in Section 3.2.2. We see that the entropy to be estimated with an error of less than 6.02% for FNN and 7% for RNN for the simulated classes, respectively. Based on the above accuracy verification of our predictors with simulated datasets from different distributions, what we can be sure is that out predictors can give almost accurate results except Markov distributions.

Complexity Analysis.
To analyze the usability of our predictors in terms of execution efficiency, we derive the following computational complexity through the analysis of theory and principle of implementation.
We believe that the computational complexity of entropy estimators used for RNG evaluation mainly comes from the sample space and sample size. For ease of analysis, we define the following parameter n as the sample size which indicates the length of the sample, s as the sample space which means the kinds of symbols in the sample (i.e., s � 8 means there are 8 symbols in the sample, and the bit width of each symbol is log 2 (8) � 3, such as 010, 110, 111, . . .), and k denotes the maximum step of correlation which is set as a constant in 90B's predictors (k � 16) and our predictors (k � 20).
rough the analysis of the implementation, the computational complexity of the final 90B's predictors [23] mainly comes from the MultiMMC predictor and is of order O(s k · n + 2 k · n · log 2 (s)), which is mainly linear time complexity of n and k-order polynomial time complexity of s. While the computational complexity of our predictor is of order O(s · n), which is linear time complexity of s and n. It can be seen that the computational complexity of our predictors is much lower than that of the 90B's predictors.
It is important to note that the MultiMMC predictor requires s k ≪ n; otherwise, this predictor cannot give accurate estimated results statistically. at is to say, when the s is increasing, the MultiMMC predictor requires larger sample size in order to estimate the entropy accurately.
From the above analysis, we can see our predictors have lower computational complexity. We will give the experimental proof in Section 4.3.

Comparison on Our Predictors With 90B'S Predictors
In this section, a large number of experiments have been done to evaluate our proposed predictors for entropy estimation from the aspects of accuracy, applicability, and efficiency by applying our predictors to different simulated data and real-world data. For the experiments mentioned above, we compare the results with the final 90B's predictors [23] to highlight the advantages of our work. Similarly, our predictors in these experiments compute an upper-bound of min-entropy estimation at the significance level α � 0.01 which is the same as 90B's predictors.

Simulated Data.
e simulated datasets are produced using the same distribution families as described in Section 3.3.1. Further, we append the following two new distribution families, such as pseudorandom sequence and postprocessing sequence which are representative and commonly used in reality:   LFSR. e samples are processed using a linear feedback shifting register (LFSR), which come from an IID source ( [35]).
For every distribution mentioned above, we also generate a set of 80 simulated datasets, each of which contains 10 6 samples, and estimate min-entropy by using our proposed predictors and final 90B's predictors [23]. Figure 7 shows the estimated min-entropy results for the 80 simulated datasets with uniform distributions and nearuniform distributions, respectively. From Figures 7(a) and  7(b), we see that several points of the results obtained from the 90B's predictors are apparently underestimated, which may result from the overfitting phenomenon. Compared with 90B's predictors, our predictors provide more accurate results. Figure 8 shows the estimated min-entropy results for normal distributions and time-varying normal distributions, respectively. From Figures 8(a) and 8(b), we can see that the estimated results given by our predictors are close to the theoretical entropy with normal distributions and timevarying normal distributions. However, the lowest entropy estimation results obtained from the 90B's predictors give significant underestimates. Figure 9 shows the estimated min-entropy results for Markov distributions. We can see that the 90B's predictors almost give underestimates compared with the theoretical entropy, while estimated results given by our predictors are much closer to the theoretical entropy than those obtained from 90B's predictors.
To further obviously compare the accuracy of our and 90B's predictors, we apply the predictors to the M-sequence and the non-uniform distribution sequence by post-processing using LFSR, and their theoretical entropy is a known and fixed value.
It is further confirmed that the higher stage (the maximum step of correlation) M-sequence and nonuniform distribution sequence by postprocessing using LFSR are able to pass the NIST SP 800-22 statistical tests [8]. e estimated results are listed in Tables 3 and 4, and the lowest entropy estimations from 90B's predictors and our predictors for each stage are shown in bold font.
For M-sequence and nonuniform distribution by postprocessing using LFSR, the MultiMMC predictor presented in the final 90B gives the most accurate entropy estimation results for the stage ≤ 16. However, when the stage of M-sequence and nonuniform distribution by postprocessing using LFSR is greater than 16, the MultiMMC predictor cannot give accurate entropy estimation result because this predictor is parameterized by k ∈ 1, 2, . . . , 16 { } (k is the maximum step of correlation). Perhaps, we could set the parameter of the MultiMMC predictor as a greater range to achieve a more accurate estimated result for the higher stage, but the time complexity grows exponentially with the parameter k as we analyzed in Section 3.3.2. Moreover, the FNN model can also give accurate estimated results, even though the stages of M-sequence and LFSR are greater than 16. However, the RNN model can give accurate estimated results only when the stage is 8. erefore, the FNN model is more matched to M-sequence and nonuniform distribution by postprocessing using LFSR than RNN. We also compute the relative errors of estimated results from 90B's predictors and our predictors over 80 sequences from each class of simulated sources. We calculate the relative errors using the min-entropy obtained from 90B's predictors (the lowest estimation result of 90B's four predictors) and our predictors (the lowest estimation result of FNN and RNN), respectively. As illustrated in Table 5, it shows that, for all five classes of simulated sources, the errors of our predictors are lower than that of the 90B's predictors. Specially, our approaches enable the entropy to be estimated with an error of less than 6.55%, but it is up to 14.65% for 90B's predictors. Overall, this indicates that our proposed predictors have a better performance than that of 90B's predictors on accuracy for both stationary sequences and nonstationary sequences, which is consistent with the conclusion drawn in the figures above.
From Tables 2-4, we also find that the accuracy of the RNN predictive model is slightly higher than that of the FNN predictive model, except for the cases of the Markov sources, M-sequence, and nonuniform distribution by postprocessing using LFSR.
We will further verify the applicability for time-varying sources in Section 4.2. erefore, through the evaluation on the entropy estimation results of the above simulated datasets, we see that our proposed predictors are superior in accuracy compared with the 90B's predictors.

Real-World Data.
We further apply our predictors to the datasets which are generated from the RNGs deployed in the real-world. In fact, the theoretical entropy of per sample is unknown for these real-world sources, so no error can be compared like the simulated datasets for the predictors. However, the estimated results from the predictors presented here can still be compared to the 90B's predictors, based on the knowledge that underestimates from the predictors have theoretical bounds.
Datasets of real-world data are produced using the following approaches. e first two are adopted in [24], and the others are commonly used typical RNGs. e estimations of the real-world sources are presented in Table 6.
is is a service that provides random numbers based on atmospheric noise and is used in [24]. It allows the user to specify the    stream of random numbers through a USB CDC serial port, which is a USB random number generator produced by Ubld.it. is entropy source is also used in [24]. e sequence used here consists of bits. (iii) Linux kernel entropy source.
e Linux kernel random generator is used for the generation of a real-world sequence without any processing. e sequence used here is the last bit of per symbol. (iv) Linux/dev/urandom. e/dev/urandom [6] of Linux is used for the generation of a real-world sequence with strict processing. e sequence used here consists of bits.
(v) Windows RNG. Windows RNG [5] is used for the generation of a real-world sequence by calling a Crypto API. e sequence used here consists of bits.
As illustrated in Table 6, the lowest entropy estimation for each source is shown in bold font. We see that our predictors perform better than 90B's predictors, because the lowest entropy estimation is always obtained from our work for each real-world source. Furthermore, for Linux kernel entropy source, we find that both of the predictor Lag and MultiMMC are able to give lower estimation results. It indicates that Linux kernel entropy source has periodicity and conforms to the Markov model, which is well understood because the randomness of Linux kernel entropy source comes from human behaviors, such as manipulating the mouse and keyboard. In our work, compared with the entropy estimations for other real-world sources, FNN fits much better than RNN for Linux kernel entropy source, which is consistent with the previous view that FNN performs well in testing Markov sources.

Comparison on the Scope of Applicability.
After evaluating the accuracy, we further validate the scope of applicability of our proposed predictors and compare them with that of the 90B's predictors. Kelsey et al. [24] stated that each of the 90B's predictors performs well only for a special distribution as described in Section 2.2.1. To prove our predictor has better applicability, the following four simulated datasets are generated, which are suitable for each predictor employed in the final 90B: (i) Time-Varying Sources. e probability distribution of data sources is varying with time. e MCW predictor predicts the current output according to previous outputs in a short period of time, and thus the MCW predictor performs well in these data sources. (ii) Periodic Sources. e data source changes periodically. e lag predictor predicts the value that occurred samples back in the sequence as the current output, and thus the lag predictor performs well on sources with strong periodic behavior. (iii) Markov Sources. e data sources can be modeled by the Markov model. e MultiMMC predictor predicts the current output according to the Markov model, and thus the MultiMMC predictor performs well on data from any process that can be accurately modeled by a Markov model. (iv) LZ78Y Sources. e data sources can be efficiently compressed by LZ78-like compression algorithms, which applies to the LZ78Y predictor well.
For each above simulated source, we generate a set of 10 simulated datasets, each of which contains 10 6 samples, and the min-entropy is estimated by our and 90B's predictors. e final result for a predictor is the average value of 10 estimated results corresponding to the 10 simulated datasets for one simulated source.

Time-Varying Sources.
Firstly, we generate the timevarying binary data which is suitable for the statistical behaviors of the MCW predictor presented in the 90B. Table 7 shows the entropy estimation results for time-varying data.
As shown in Table 7, symbol gradual(x) (x ∈ [0, 1], the same below) is defined as a simulated source that the probability of output "0" changes gradually from x to 1 − x with time. Symbol period(x) is defined as a simulated source that the probability of output "0" changes periodically with time, and the probability varies from x to 1 − x in one period. e period length is set to 20% of the entire input dataset. Symbol sudden(x) is defined as a simulated source that the probability of output "0" changes suddenly with time, namely, the probability is set to x for the first half of the input dataset and 1 − x for the last half.
From Table 7, the estimation results for MCW predictor and our work are shown in bold font. We see that the MCW predictor gives the lowest and most accurate entropy estimations for the three types of time-varying data mentioned above, but it gives a little underestimates at gradual(0.2) and period(0.2). It is confirmed that the time-varying sources mentioned above match with the statistical behaviors of the MCW predictor. Relatively, we find that our proposed predictive models are all capable to obtain the satisfied entropy estimations that are close to the correct values. erefore, it is proved that our proposed predictive models are suitable for the time-varying data mentioned above. Note that we calculate the min-entropy estimate according to the entire dataset rather than the last 20% of the input dataset for these time-varying sources. Because the probability distribution is varying with time, the part of the input dataset cannot represent the overall distribution of the input dataset.  Table 8, the estimation results for the lag predictor and our work are shown in bold font. According to the correct entropy (is equal to 0) of the simulated periodic sources, we confirm that the lag predictor is suitable for the entropy estimation of this type of source as expected. Relatively, the RNN can also give the accurate min-entropy estimates, i.e., estimated results are zeros. us, our proposed predictive models are suitable for the entropy estimation of the (strong) periodic data. In addition, the MultiMMC predictor can also give the accurate min-entropy estimations. is is reasonable because periodicity is also a form of correlation.

Markov Sources.
Next, we generate multivariate M-sequences as Markov sources which fit the statistical behaviors of the MultiMMC predictor. Specifically, the multivariate M-sequences are composed of multiple M-sequences with different initial states. Due to the determinacy of this type of sequences, the correct entropy is zero. e bit width of the samples is also traversed from 2 to 8. e maximum step of correlation used here is set as 8. Table 9 shows the estimated results for multivariate M-sequences.
From Table 9, the estimation results for MultiMMC predictor and our work are shown in bold font. According to the correct entropy (is equal to 0) of the simulated Markov sources, we confirm that the MultiMMC predictor is suitable for the entropy estimation of this type of source as expected. Relatively, the RNN can also give the accurate min-entropy estimations, i.e., estimated results are zeros. us, our proposed predictive models are suitable for the Markov sources.

LZ78Y
Sources. Finally, we verify the applicability of the LZ78Y sources. is type of entropy source is difficult to generate by simulating. However, we can still draw the conclusion that our proposed predictive models can be applied to the LZ78Y sources according to Tables 8 and 9 in italic font. Because the periodic data and Markov sequences are compressible.

Summary on Applicability Scope of Our Predictors.
By analyzing the experimental results of the above four specific simulated sources, each of which is oriented towards a certain predictor in the 90B, we have a conclusion that our predictors can provide accurate estimated results of entropy. So, the proposed predictors are well applied to these entropy sources as well as the 90B's predictors. In addition, compared with 90B's predictors, our predictors have a better performance on the scope of applicability for testing the    Table 10 shows the mean execution time of our predictors in comparison with that of the final 90B's predictors and the second draft of 90B's predictors. Each experimental result in Table 10 is the average value obtained from 50 repeated experiments. Note that the definitions of parameter n, s, and k are the same as in Section 3.3.2.
From the listed mean execution time with different scales ( n, s { }) in Table 10, it can be seen that when n � 10 6 , the mean execution time of our predictors is much lower and increasing slower with any s than that of the final 90B's predictors. In other words, the average execution efficiency of our predictors is about 7 to 10 times higher than that of the final 90B's predictors for different sample space s when the sample size n is 10 6 . In particular, when n � 10 8 , the mean execution time given by final 90B's predictors is far more than our predictors regardless of the size of sample space and is too long (over three days) to calculate the estimated results on the case s ≥ 2 2 . In terms of execution efficiency of 90B's predictors, we also find that the mean execution time of the final 90B's predictors is much higher than that of the second draft of 90B's predictors. Actually, the final 90B's mean execution time is about twice as much as the second draft of 90B's. is could be caused by the characteristics of some estimators which are limited to only for binary inputs. Because the collision estimator, Markov estimator and compression estimator are only suitable for binary input (0 or 1) as stated in [23]. So for nonbinary inputs, the 90B's estimators will not only calculate the original symbol entropy but also convert it into binary input to calculate the bit entropy and finally get the min-entropy. is will greatly increase the mean execution time.

General Discussion.
For the most entropy sources which have been tested, the RNN gives more accurate estimations than the FNN. Better accuracy of the RNN predictive model may be due to the following reasons. On the one hand, RNN adds the feedback connections to the network, i.e., it considers not only the relationship between the current output and the previous observations but also the relationship among the previous observations. On the other hand, RNN one-hot-encodes the training dataset for better forecasting categorical data. On the contrary, for Markov sources, M-sequence and nonuniform distribution by postprocessing using LFSR, the current output is only related to the previous observations, which fits the FNN predictive model well and thus the FNN provides more accurate estimated results.

Conclusions and Future Work
Entropy estimation provides a crucial evaluation for the security of RNGs. e predictor serves as a universal sanity check for entropy estimation. In this work, we provide several new approaches to estimate the min-entropy for entropy sources using predictors based on neural networks (i.e., FNN and RNN) for the first time. In particular, we design a novel scheme for the proposed entropy estimation based on neural network models, including execution strategy and parameter settings. In order to evaluate the quality of the proposed predictors, we collect various types of simulated sources that belong to the stationary or nonstationary, whose correct entropy of the source can be derived from the known probability distribution, and the theoretical result is further verified by the experiments of the real-world sources. We also compare our method with the predictors defined in the NIST 800-90B (published in 2018) which is a commonly used standard for evaluating the validation of entropy sources. Our assessment experiments are carried out in three aspects, namely, accuracy, scope of applicability, and computational complexity. e experimental results demonstrate that the entropy estimation obtained from our proposed predictors are more accurate than that of the 90B's predictors, and our predictors have a remarkably wide scope of applicability. In addition, the   2 1, 058 525 138 10 6 , 2 3 1, 109 574 149 10 6 , 2 4 1, 235 598 174 10 6 , 2 5 1, 394 630 190 10 6 , 2 6 1, 683 785 186 10 6 , 2 7 2, 077 938 264 10 6 , 2 8 2, 618 1, 298 272 10 8 , 2 1 52, 274 47, 936 9, 184 10 8 , 2 2 --9, 309 10 8 , 2 3 --9, 385 10 8 , 2 4 --9, 836 10 8 , 2 5 --10, 986 10 8 , 2 6 --13, 303 10 8 , 2 7 --17, 649 10 8 , 2 8 -- 20,759 computational complexity of ours is obviously lower than that of the 90B's with the growing sample space and sample size in theoretical analysis. e average execution efficiency of our predictors is about 7 to 10 times higher than that of the 90B's predictors for different sample spaces when the sample size is 10 6 . Specially, the 90B's predictors cannot calculate out a result due to the huge time complexity when the sample space s is over 2 2 with the parameter of maximum step k � 16 and sample size n � 10 8 ; relatively, our method is able to provide a satisfied result towards the entropy sources with large sample space and long dependence. Future work is aiming at designing some specific neural network predictive models for min-entropy estimation for some specific entropy sources. Our future work will also focus on applying this new method to estimate entropy for more application areas, like the randomness sources (sensors and other sources) in mobile terminals.

Data Availability
RANDOM.ORG data used to support the findings of this study can be accessed from https://www.random.org. Ubld.it TrueRNGpro, Linux kernel entropy source and Linux/dev/ urandom, and Windows RNG data used to support the findings of this study can be obtained from the relevant listed references.