Comprehensive Uncertainty Quantification in Nuclear Safeguards

Nuclear safeguards aim to confirm that nuclear materials and activities are used for peaceful purposes. To ensure that States are honoring their safeguards obligations, quantitative conclusions regarding nuclear material inventories and transfers are needed. Statistical analyses used to support these conclusions require uncertainty quantification (UQ), usually by estimating the relative standard deviation (RSD) in random and systematic errors associated with each measurement method. This paper has two main components. First, it reviews why UQ is needed in nuclear safeguards and examines recent efforts to improve both top-down (empirical) UQ and bottom-up (first-principles) UQ for calibration data. Second, simulation is used to evaluate the impact of uncertainty in measurement error RSDs on estimated nuclear material loss detection probabilities in sequences of measured material balances.


Introduction
Nuclear material accounting (NMA) provides a quantitative basis to detect nuclear material loss or diversion at declared nuclear facilities. NMA involves periodically measuring facility input transfers in , output transfers out , and physical inventory to compute a material balance (MB) defined for balance period as MB = ( −1 + in, − out, ) − , where ( −1 + in, − out, ) is the book inventory. In NMA, one MB or a collection of MBs are tested for the presence of any statistically significant large differences and/or for trends, while allowing for random and systematic errors in variance propagation to estimate the measurement error standard deviation of MB , MB . Similarly, in verification activities done by an inspector, paired operator and inspector data are tested for any large differences and/or for trends [1,2]. Therefore, both material balance evaluation and verification activities require statistical analyses, which require UQ.
In metrology for nuclear safeguards, the term "uncertainty" characterizes the dispersion of estimates of a quantity known as the measurand, which is typically the amount of NM (such as U or Pu) in an item. To measure the amount of NM, both destructive analysis (DA, a sample of item material is analyzed by mass spectrometry in an analytical chemistry laboratory) and nondestructive assay (NDA, an item is assayed by using a neutron or gamma detector) are used. NDA uses calibration and modeling to infer NM mass on the basis of radiation particles, such as neutrons and gammas emitted by the item and registered by detectors. For any measurement technique, one can use a firstprinciples physics-based or "bottom-up" approach to UQ by considering each key step and assumption of the particular method. Alternatively, one can take an empirical or "topdown" approach to UQ, for example, by comparing assay results on the same or similar items by multiple laboratories and/or calibration periods.
A well-known guide for bottom-up UQ is the Guide to the Expression of Uncertainty in Measurement (GUM, [3]). The GUM also briefly mentions top-down UQ in the context of applying analysis of variance (ANOVA, [4]) to data from interlaboratory studies. Although the GUM is useful, it is being revised because it has known limitations [5][6][7]. For example, the GUM provides little technical guidance regarding calibration as a type of bottom-up UQ or regarding 2 Science and Technology of Nuclear Installations top-down UQ [5][6][7][8]. The GUM also mixes Bayesian with non-Bayesian concepts. In a Bayesian approach, all quantities, including the true measurand value, are regarded as random. In a non-Bayesian (frequentist) approach, some quantities are regarded as random and other quantities, such as the true value of the measurand, are regarded as unknown constants. This paper uses both Bayesian and non-Bayesian concepts but specifies when each is in effect. For example, in the Bayesian approach to top-down UQ in Section 3, the true RSD values are regarded as being random.
In NDA safeguards applications, the facility operator declares the NM mass of each item. Then, some of those items are randomly selected for NDA verification measurement by inspectors. This is a challenging NDA application because often the detector is brought to the facility where ambient conditions can vary over time and because the items to be assayed are often heterogeneous in some way and/or are different from the items that were used to calibrate/validate and assess uncertainty in the NDA method. Because of such challenges, "dark uncertainty" [9] can be large, as is evident whenever bottom-up UQ predicts smaller measurement error RSDs than are observed in top-down UQ [1]. The RSD of an assay method is often defined as the reproducibility standard deviation as estimated in an interlaboratory comparison. As shown in Section 3, comparing NDA verification measurements to the operator's DA measurements can be regarded as a special case of an interlaboratory evaluation [10][11][12].
For top-down UQ applied to NM measurements of the same item by both the operator (often using DA) and the inspector (often using NDA), this paper describes an existing and a new approach to separately estimate operator and inspector systematic and random error variance components. Systematic and random error components must be separated because their modes of propagation are different (Section 4). Currently, random error variance estimates (from paired data) are based on Grubbs' estimator or variations of Grubbs' estimator, which was originally developed by Grubbs to estimate random error variance separately for each of the two methods applied to each of several items, without repetition of measurement by either method [13,14]. In Section 3, Grubbs' estimator, constrained versions of Grubbs' estimator, and a Bayesian alternative [7] are described; the Bayesian option easily allows for parameter constraints and prior information regarding the relative magnitudes of variance components to be exploited to improve top-down UQ.
This paper is organized as follows. Section 2 provides a background on bottom-up UQ for NDA, describes a gammabased NDA example and a neutron-based NDA example, and illustrates why simulation is necessary for improved UQ for calibration data. Section 3 reviews currently used top-down UQ and describes a new Bayesian option [7] that applies approximate Bayesian computation. Section 4 provides a new simulation study assessing the sensitivity of estimated NM loss detection probabilities to estimation errors in measurement error RSDs. Section 5 concludes with a summary.

Bottom-Up UQ
For bottom-up UQ, the GUM [3] assumes that the measured value can be expressed using a measurand equation that relates input quantities (data collected during the measurement process and relevant fundamental nuclear data such as attenuation corrections) to the output (the final measurement value). The GUM's main technical tool is a first-order Taylor approximation applied to the measurand equation which relates input quantities 1 , 2 , . . . , (regarded as random) to the measurand Y (also regarded as random). The input quantities can include estimates of other measurands or of calibration parameters, so (1) is quite general. The variance of each and 2 and any covariances, , , between pairs of 's are then propagated using the Taylor approximation to obtain 2 ≈ ∑ =1 ( / ) 2 , (or using simulation if the Taylor approximation is not sufficiently accurate) to estimate the variance in , 2 .
According to (1), the estimated value of the measurand is a random variable, regardless of whether the left side of (1) is expressed as Y (as in a typical Bayesian approach) or as( as in a typical non-Bayesian setting). The hat notation is a frequentist convention for denoting an estimator, sôis an estimate of , and (which is also denoted as , where "T" denotes the true value) denotes the unknown true value of the measurand.

Calibration UQ as an Example of Bottom-Up UQ.
Typically, calibration is performed using reference materials having nominal measurand values (known to within a relatively small uncertainty), and then, in the case of linear calibration, (1) can be reexpressed aŝ=̂+̂1 , wherêis the estimated measurand value,̂0 and̂1 are parameters estimated from calibration data, is the net count rate (usually the net gamma or net neutron count rate in NDA; see Section 2.3), and the three inputs in mapping to (1) are 1 = 0 , 2 =̂1, and 3 = . The estimateŝ0 and̂1 will vary in predictable ways (Sections 2.2 and 2.3) across repeats of the calibration.
The convention in statistical literature to reverse the roles of and from that in GUM's equation (1) will be followed here, so denotes the quantity to be inferred (the measurand value) and denotes the detected net radiation count rate. Then, in the case of reverse regression (see below), (1) can be expressed as identifying 1 =̂0, 2 =̂1, and 3 = . Following calibration on data consisting of ( , ) pairs (lowercase denotes observed value of a random variable), the three "input quantities" 1 =̂0, 2 =̂1, and 3 = have variances and covariances that can be estimated. However, in most applications of calibration in NDA, accurate estimation of these variances and covariances requires simulation because Science and Technology of Nuclear Installations 3 analytical approximations as described in Section 2.2 have been shown to be inadequate (see Section 2.3).
Expressing (2) as = ( 1 , 2 , . . . , ) =̂0 +̂1 indicates how the estimate is computed and how to assign systematic and random error variances to . For example, and to introduce notation used in top-down UQ (Section 3), one could express the estimate as = + + , where denotes the true value of the measurand, S denotes systematic error due to estimation error in the fitted slope and intercept and/or due to correlations among the inputs, and denotes random error. If there are no correlations among the inputs but only estimation error in the fitted slope and intercept during calibration, then expressions for the variances of and R, denoted as 2 and 2 , respectively, can be given (Sections 2.3 and 3), which allow comparison between bottom-up UQ and top-down UQ. The GUM does not discuss calibration in much detail; instead, the GUM applies propagation of variance to the steps modeled in (1), which sometimes leads to a defensible estimate of the combined variances of and . The GUM does not attempt to separately estimate the variances of and R, but such separation is needed in some applications, such as assigning an uncertainty to a sum of measurand estimates ( [15] and Section 4).

Extension of Standard Regression Results to Calibration.
One way to express that the net count rate depends on the true measurand value is which is a typical model used in regression when there is negligible error in . If errors in predictors cannot be ignored, (3) should be modified; however, one can still regress measured on measured X, so, in effect, (3) can be reexpressed as =̃0 +̃1 +̃, where the tildes denote that parameter values and the random error are different from those in (3). In inverse calibration, (3) is used, and one inverts the fitted model usinĝ0 and̂1 to use future measured test to predict enrichment usinĝ which is regression followed by inversion. An alternative model to (3) is reverse calibration: In reverse calibration, (5) expresses the measurand as a function of the true net count rate , but in practice one must regress on = true + , where is a random error. As an aside, this paper does not consider models with systematic errors such as = True + + or = true + + . Cheng and Van Ness [16] point out that any additive systematic errors in or could be absorbed into 0 and 0 , respectively; however, any systematic errors in the values used for calibration would remain a part of the total uncertainty.
Both inverse and reverse calibrations involve ratios of random variables, which can be problematic [7,17]. In inverse calibration, the solution in (4) involves division by the random variablê1, which has a normal distribution under typical modeling assumptions. Williams [18] notes that̂t est in (4) has infinite variance even if the expected value of̂1 is nonzero, due to division by a normal random variable [19], and hence has infinite mean squared error, while the reverse estimator has finite variance and mean squared error. In reverse calibration, the least squares solution̂=̂= { cal cal / cal cal } also involves division of random variables ( cal is the vector or matrix of values used in calibration and cal is the vector of values in calibration). Experience suggests that one can develop adequate approximations for the ratio of random variables when the ratio is almost certain to be far from infinity or zero [19]. Ignoring errors in predictors, [20] uses the following common approximation for the variance of the ratio of random variables and V: where denotes expectation to derive the approximation (for inverse calibration) for variance due to uncertainty in the estimated calibration coefficientŝ0 and̂1 and in the test measurement test : wherẽ= − and is the mean of the values in the calibration data. To apply (7), 2 1 and 2 are estimated from the calibration data (assuming (3) in forward calibration or the alternate version of (3), =̃0 +̃1 +̃). Equation (7) is almost the same as the corresponding well-known result for regression; the only differences are the swapping of the roles for and y and the appearance of 2 1 in the denominator. For reverse regression, [20] derives var(̂) ≈ [20] also showed the longterm bias inverse ≈ ( − ) 2 / 2 1 for inverse calibration and reverse ≈ −( − )/(1 + 2 Notice that inverse decreases as increases (because increase as increases), but reverse does not decrease as increases; however, recall that, in NDA applications, n is small, usually 3 to 10.
A common summary performance measure of an estimator combines squared bias and repeatability variance defined as RMSE = (repeatability variance) + (bias) 2 ; that is, where denotes the expected value (i.e., the first moment of the underlying probability distribution) [7]. Some technical details arise regarding the best model fitting approach if the predictor is measured with nonnegligible error. In addition, there is controversy regarding the relative merits of inverse and reverse calibration [7,17,21,22]. Simulation can be used to choose between inverse and reverse calibration, because simulation provides accurate UQ (such as RMSE estimation) for both options. In simulations for NDA calibration, errors in the standard reference materials' nominal values ( 's) are usually small compared to errors in the instrument responses Y's, which are possibly adjusted by using adjustment factors that have uncertainty (see Section 2.3).

Summary of Recent NDA Examples.
Recent publications have used simulation to assess the adequacy of (7) in the context of NM measurements by gamma detection [7,17] and neutron detection [1,7,23,24].

Enrichment Meter Principle (EMP).
The EMP aims to infer the fraction of 235 U in U (enrichment, defined as atom percent of 235 U in an item) by measuring the count rate of the strongest-intensity direct (full-energy) gamma from decay of 235 U, which is emitted at 185.7 keV [7,25,26]. The EMP makes three key assumptions: (1) the detector field of view into each item is the same as that in the calibration items, (2) the item is homogeneous with respect to both 235 U enrichment and chemical composition, and (3) the container attenuation of gamma-rays is the same as or similar to that in the calibration items, so empirical correction factors have modest impact and are reasonably effective. If these three assumptions are approximately met, the enrichment of 235 U in the U is directly proportional to the count rate of the 185.7 keV gamma-rays emitted from the item. It has been shown empirically that, under good measurement conditions, the EMP can have a random error RSD of less than 0.5% and long-term bias of less than 1% relative to the true value, depending on the specific implementation of the EMP. Implementation details include features such as the detector resolution, stability, and extent of corrections needed to adjust items to calibration conditions. However, in some EMP applications, the random error RSD can be larger than bottom-up UQ predicts and larger than the 0.5% goal. For example, assay of the 235 U mass in a stratum of UO 2 drums suggests that there is a larger-than-anticipated random RSD [17].

Uranium Neutron Coincidence Collar (UNCL).
The UNCL uses an active neutron source to induce fission in 235 U in fresh fuel assemblies [27]. Neutrons from fission are emitted in short bursts of time and so exhibit non-Poisson bursts in detected count rates. Neutron coincidence counting is used to measure the "doubles" neutron coincident rate Y, which can be used to estimate the linear density of 235 U in a fuel assembly ( -235 U/cm) using calibration parameters, 1 and 2 . The coincident rate is the observed rate of observing two neutrons in very short time gates, each of approximately 10 −6 sec, and is attributable to fission events. The equation commonly used to convert the measured doubles rate to an estimate of X (grams 235 U per cm) is , where 1 and 2 are calibration parameters, and = 0 1 2 3 4 5 is a product of correction factors that adjust to item-, detector-, and source-specific conditions in the calibration [27]. Therefore, = /( 1 − 2 ) is a special case of GUM's equation (1) (with and reversed), where the two calibration parameters 1 and 2 and the 6 correction factors 0 , 1 , 2 , 3 , 4 , and 5 are among 's in (1). Reference [23] showed that calibration is most effective (leading to smallest RMSE in̂) if there is no adjustment for errors in the predictor and that errors in 0 , 1 , 2 , 3 , 4 , and 5 , in = 0 1 2 3 4 5 , should be included in synthetic calibration data. Note that, by working with 1/X and 1/Y, one can convert = /( 1 − 2 ) to one that is linear in the transformed predictor 1/ . (1) If possible, both classical (see (2)) and reverse (see (3)) regression methods should be compared; however, reverse regression tends to do either as well as or better than classical regression. Analytical approximations such as (7) have been shown not to be sufficiently adequate in some settings, so simulation is recommended to compare classical and reverse regression and to estimate variance components in = + + (Section 3).

Main Results
(2) Error sources that are expected to be present in test measurements, such as container thickness measurements, can be simulated in synthetic calibration data. Such error sources often lead to item-specific biases .
(3) If reverse regression is used, then there is no need to adjust for errors in the predictors in (3). If inverse regression is used, then it is better to adjust for errors in predictors.
(4) Figure 1 plots (a) the observed and predicted bias and (b) the observed and predicted RMSE in a generic NDA example involving either gamma or neutron counting. It is not well known that calibration applications lead to bias, and [7,17] showed that the bias cannot be easily removed, because measurement errors obscure the true measurand value and hence the true bias. Note in Figure 1(a) that the observed bias (in simulated data) is not in close agreement with the predicted bias, which is obtained from the expressions in Section 2.2. Therefore, long-term bias should be estimated using simulation rather than relying on the approximate Figure 1(b) illustrates that the observed RMSE is not well predicted by the expressions in Section 2.2, so, again, simulation is needed for adequate estimation of the RMSE. Note that the smallest RMSE is for reverse regression. Burr et al. [7,17] show that reverse regression tends to have smaller RMSE than inverse regression but that if inverse regression is used, then methods to adjust for errors in predictors should be used. Figure

Top-Down UQ to Estimate Variance Components
In facilities under international safeguards agreements, inspectors measure randomly selected items to monitor for possible data falsification by the operator that could mask NM diversion [1,28]. These paired ( , ) data ( denotes operator measurement; denotes inspector measurement) are assessed using one-item-at-a-time testing to detect significant differences and also by using an average of the operator-inspector values to detect trends in the context of material balance evaluation [1]. Conclusions from such an assessment depend on the assumed measurement error model and associated random and systematic uncertainty components, so it is important to perform effective UQ [1,7,17,28]. The paired ( , ) data are collected during relatively short (one week) inspections that occur once or a few times per year, and then several years of paired ( , ) inspection data are included in top-down UQ. The measurement error model must account for variation within and between groups, where, in this context, a group is an inspection period. A typical top-down model used for additive errors for the inspector ( ) (and similarly for the operator ) is where is the inspector's measured value of item (from 1 to ) in group (from 1 to ), is the true but unknown value of item from group , ∼ (0, 2 ) is a random error on item from group , and ∼ (0, 2 ) is a shortterm systematic error in group [28]. The error variance components 2 and 2 can be estimated using a specialized version of random-effects one-way ANOVA described in Section 3.1. NDA measurements often have larger uncertainty at larger true values, which implies a multiplicative rather than an additive error model. However, provided that the individual RSDs are fairly small, resulting in a total RSD of approximately 10% or less, a multiplicative error model such as = (1 + + ) can be analyzed in the same manner as an additive error model, by analyzing on the log scale [1,2]. Therefore, for brevity, only an additive error model such as in (8) is presented here. Bonner et al. [1] provide new expressions for a multiplicative error model that should be used if the total RSD is approximately 10% or larger. Note that one could write (8) in more cluttered notation as = + + . That is, is the inspector's measured value of item , which is obtained using various inputs, denoted with 's on the right side of (1). And one could also consider other error models, such as error models that allow for nonconstant absolute or relative random and/or systematic SD [28,29].
The GUM [3] briefly describes ANOVA in the context of top-down UQ using measurement results from multiple laboratories and/or assay methods to measure the same measurand; however, the GUM is mostly concerned with bottom-up UQ. The GUM does not explicitly present any measurement error models such as (8) but only considers the model for the measurand, (1). However, the GUM implicitly endorses the notion of a measurement error model (or "observation equation," [14]) such as (8) in its top-down UQ. Note that if total measurement error is partitioned into random and systematic components, then the variance of a sum of measured NM values (which is often needed in safeguards assessments; see Section 4) is 2 2 + 2 for an additive model such as (8).
To illustrate, Figure 2 plots = − data simulated from (8) with = 10, = 5, = 1, = 0.1, = 3, = 1, True = 100, and the standard deviation of the true values, = 1. The horizontal lines depict the five group means. Equation (8) implies that there is a need to partition error variance of into "between" (B) and "within" (W) groups, as in 3.1. Grubbs' Estimator as an Example of Top-Down UQ to Estimate 2 , 2 , 2 , 2 . Standard random-effects ANOVA [4] requires repeated measurements on some items in order to estimate 2 and then 2 in data assumed to be produced by a model such as (8). However, for most ( , ) data, repeated measurements of the same item are not available, so this section describes Grubbs' estimator. Grubbs' estimator was developed for situations in which more than one measurement method is applied to each of multiple test items (which may contain different material amounts), but there is no replication of measurements by any of the methods. Grubbs' estimator will be described for additive measurement error models; a new version of Grubbs' estimator for multiplicative error models is described in [1]. Note that the variance 2 of the random error variance component includes the effects of "item-specific" bias (see Section 2, [7,28]), which could not be estimated if available data were only from repeated measurements of the same or very similar items. Note also that Grubbs' estimator does not consider the possibility of falsification by the operator, so it is intended to be applied to paired ( , ) data that has no evidence of falsification.
The basis of Grubbs' estimator within one group to estimate 2 and 2 is that the covariance between operator and inspector measurements equals the variance of the true item masses, 2 , while the variance of , 2 , conditional on the value of is given by 2 = 2 + 2 . Therefore, the sample covariance within a single inspection period between operator and inspector measurements can be subtracted from the sample variance of the inspector measurements to estimate 2 (and similarly for estimating 2 ). That is, within one inspection period (group) (lowercase ( ) denotes the observed values of ( )), Grubbs' estimator is given bŷ The estimates from (10) from each of the groups are averaged to get the final estimate of the inspector's random error variance, and similarly, the estimate of 2 is the average of the sample covarianceŝ2 = ∑ =1 ( − )( − )/( − 1) computed within each group.
To estimate 2 in (7), a minor extension of standard random-effects ANOVA to account for 2 shows that There is no guarantee that̂2 or̂2 are nonnegative, but the corresponding true quantities are nonnegative (i.e., 2 ≥ 0 and 2 ≥ 0), so constrained versions of Grubbs-based and ANOVA-based estimators can be used (Section 3.2, [7,30]).
Grubbs showed (and simulation in R [25] also verified) that his estimator for 2 has variance 2 2 = 2 4 /( − 1) + (1/( − 1))( 2 2 + 2 2 + 2 2 ), which is relatively large, particularly if 2 is comparable in magnitude to 2 [7] and/or is small, so [13] proposed an option to mitigate the impact of 2 on 2 2 . The method in [13] relied on knowing the value, or approximate value of 2 / 2 , and studied the sensitivity to misspecifying the ratio 2 / 2 . The Bayesian option in Section 3.2 specifies a probability distribution for 2 / 2 prior to observing the ( , ) data, as an example of enforcing nonnegativity constraints and including prior information, such as information from bottom-up UQ regarding 2 , 2 , and/or 2 , and similarly for the operator or for variance ratios such as 2 / 2 . A measurement error model that can be used in top-down UQ for the type of data in Figure 2 was given in (8). In the case of inverse regression as a type of bottom-up UQ, using (5), one can modify the top-down error model of (8) to where ∼ (0, 2 ) ( 2 is random error variance that includes the effects of errors in and ); 1 =̂0 − 0 (an additive subcomponent of systematic error; see below), and 2 = (̂1 − 1 )( Test − ) (a multiplicative subcomponent of systematic error).
Assume that inverse regression is performed using meancentered data (̃= − train and̃= − train ), so cov(̂0,̂1) = 0, which simplifies interpretation. The equation 1 =̂0 − 0 is then interpreted to mean that 1 has mean zero and variance 2 / . Note that one cannot use paired ( , ) data to estimate 2 1 and 2 2 , because the effects of 2 1 and 2 2 are confounded. However, simulation such as that used to produce Figure 1 provides a way to perform bottom-up prediction of top-down SD (or RSD) that can be compared to the SD (or RSD) observed in top-down evaluations. If the bottom-up predicted SD agrees well with that observed in top-down ( , ) data, then bottom-up UQ via simulation (because analytical approximations such as those plotted in Figure 1 in the calibration context have been shown to not be

New Bayesian Approach to Grubbs-Type Estimation.
Recall that the variance of Grubbs' estimator can be large, so [7,31] review alternatives to Grubbs' estimator based on constrained optimization, such as Jaech's [31] constrained maximum likelihood estimator (CMLE), which assumes that the random and systematic measurement errors are normally distributed. Also, although the impact of 2 can be relatively large on Grubbs' estimator, versions of Grubbs' estimators that are constrained so that̂2 +̂2 =̂2, wherê2 is the sample variance of the differences = − , have exhibited lower RMSE than Grubbs' estimator in limited testing to date. Any estimator of 2 , 2 , 2 , and 2 must be accompanied by its respective uncertainty. The uncertainty in CMLE or constrained least squares estimators [32] can be approximated using approximate analytical results and asymptotic results or by resampling methods such as the bootstrap. The quality of such approximations is not yet known; so a Bayesian alternative is presented here which does not rely on such approximations or the bootstrap for assessing uncertainty in̂2 ,̂2 ,̂2 , or̂2 .
Another reason to consider the Bayesian alternative is that Grubbs' estimator exhibits dependence on the underlying measurement error distribution. Figure 3 plots the estimated probability density for Grubbs' estimator for̂, when the underlying random error distribution is either the normal, lognormal, or generalized lambda with thin or thick tails, for the relatively large sample size of 5 groups and 10 measurements per group. The probability density for̂was estimated using a kernel-density estimator in R [25]. There is a relatively large uncertainty in̂(Section 4 evaluates one impact of uncertainty in estimated RSDs), with an RSD in̂of 11%, 8%, 13%, and 37%, respectively, for the four distributions in Figure 3, and an RSD in̂of approximately 50% for all four distributions; also, for 2 groups and 5 measurements per group, the RSD in̂is approximately 50% for all four distributions. One implication of Figure 3 is that uncertainties in̂2 ,̂2 ,̂2 , or̂2 depend on the error distributions. An advantage of the Bayesian option is that the width of the Bayesian posterior adjusts to accommodate nonnormal underlying distributions [7].
One option to improve Grubbs' estimator is to impose constraints, such as ≤ , or, more flexibly, by assigning a prior probability to the ratio / which puts most of the probability on values less than 1. The uncertainty in constrained estimators is simple to estimate in a Bayesian approach and is often difficult to estimate in a non-Bayesian framework. Within an inspection group for fixed and , the paired ( , ) data from (8) has a bivariate distribution with covariance matrix Σ = ( 2 + 2 2 2 2 + 2 ) and [33] provided a Bayesian approach to estimate 2 , 2 , and 2 , assuming a bivariate normal likelihood, without imposing constraints on any of the variance ratios. In principle, one could extend the Bayesian approach in [33] to allow for a nonnormal likelihood and/or to allow for constraints on any of the variance ratios. In practice, such extensions are technically difficult and rarely attempted.
In any Bayesian approach, prior information regarding the sizes or relative sizes of 2 , 2 , and 2 must be provided. If the prior is "conjugate" for the likelihood, then the posterior is in the same likelihood family as the prior, in which case analytical methods are available to compute posterior prediction intervals for quantities of interest, so that a wide variety of priors and likelihoods can be accommodated; modern Bayesian methods do not rely on conjugate priors but use numerical methods to obtain samples of 2 , 2 , and 2 from their approximate posterior distributions [34]. For numerical methods such as Markov Chain Monte Carlo, the user must specify a prior distribution for 2 , 2 , and 2 and a likelihood (which need not be normal). The Bayesian approach presented next is approximate Bayesian computation (ABC), which does not require a known likelihood for the data and can accommodate constraints on variances and/or ratios of variances by choice of the prior distributions.
The "output" of any Bayesian analysis is the posterior distribution and so the output of ABC is an estimate of the posterior distributions of 2 , 2 , and 2 . No matter what type of Bayesian approach is used, a well-calibrated Bayesian approach satisfies several requirements. The requirement of interest here is that, in repeated applications of ABC, approximately 95% of the middle 95% of the posterior distribution for each of 2 , 2 , and 2 should contain the respective true values.
In ABC, one assumes that a model has input parameters and outputs data = ( ) ( for "model") and that there is corresponding observed real data obs . Here, the model is (8), which specifies how to generate synthetic and data and does require a likelihood; however, the true likelihood used to generate the data need not be known to the user. Synthetic data is generated from the model for many trial values of , and trial values are accepted as contributing to the estimated posterior distribution for | obs if the distance ( obs, ( )) between obs and = ( ) is reasonably small. Alternatively, for most applications, it is necessary to reduce the dimension of obs to a small set of summary statistics and instead accept trial values of if ( ( obs), ( ( ))) < , where is a user-chosen threshold. Here, obs is the paired ( , ) data in each inspection group, and ( obs ) includes within-and between-groups sums of squares and within-group covariance between and . Specifically, recall that the estimator of 2 in (8) The quantitieŝ2 ,̂2 ,̂2 ,̂2 , and̂2 are therefore good choices for summary statistics to be used for ABC. Because trial values of are accepted if ( ( obs ), ( ( ))) < , an approximation error to the posterior distribution arises which several ABC options attempt to mitigate. Such options involve weighting the accepted values by the actual distance ( ( obs ), ( ( ))) (abctools in R [25]). As an aside, if the error model is multiplicative rather than additive, expressions given in [1] can be used as summary statistics.
To summarize, ABC consists of three steps: (1) sample parameter values from their prior distribution prior ( ); (2) for each simulated value of in (1), simulate data from (8); (3) accept a fraction of the sampled prior values in (1) by checking whether the summary statistics computed from the data in (2) satisfy ( ( obs ), ( ( ))) < . If desired, aiming to improve the approximation to the posterior, adjust the accepted values on the basis of the actual ( obs , ) value. ABC requires the user to make three choices: the summary statistics, the threshold , and the measure of distance . Reference [35] introduced a method to choose summary statistics that use the estimated posterior means of the parameters based on pilot simulation runs. Reference [36] used an estimate of the change in posterior posterior ( ) when candidate summary statistics are added to the current set of summary statistics. Reference [37] illustrated a method to evaluate whether a candidate set of summary statistics leads to a well-calibrated posterior in the same sense that is used in this paper; that is, nominal posterior probability intervals should have approximately the same actual coverage probability.
To illustrate application of ABC to top-down UQ, simulations using (8) were performed. Recall that an additive model is a reasonable approximation to a multiplicative model if the error variances are small and the data is analyzed on a log scale or if there are effects in addition to calibration which impact the random and systematic errors, such as random changes in background count rates which are not adjusted for. Simulations from a multiplicative error model were also investigated, with good results, such as those described next for the additive model; the summary statistics for ABC to use Grubbs-type estimators for a multiplicative model are given in [1,7].
The simulations were performed in R using three steps. In the first step, ABC requires a training collection of parameter values and corresponding summary statistics for each of many simulations. So, in each of 10 5 simulations, the values for , , , , were sampled from their respective prior probability densities. In the second step, (8) was used to generate and data. In the third step, the expressions for̂2 ,̂2 ,̂2 ,̂2 , and̂2 given above were used as summary statistics, resulting in a parameter matrix and corresponding summary statistics matrix, each having 10 5 rows and 5 columns. Then, in a separate set of 10 5 simulations, the same first and second steps were repeated, and for each simulated set of parameters and summary statistics, the parameter and summary statistics matrices from the third step in training were used in the abc function in the abctools package, using an acceptance fraction of 0.01 (meaning that 1% of the trial values for the true parameters were accepted), to produce an approximate posterior for each of the five parameters. This posterior can be used to assess the actual coverage probability, for example, of an interval that contains 95% of the posterior probability.
It was found [7] that the actual coverages for , , , were essentially the same (to within simulation uncertainty) as the nominal coverages, at 90%, 95%, and 99% probabilities, for a normal distribution and all of the nonnormal distributions investigated (uniform, gamma, lognormal, beta, t, and generalized lambda with thick or thin tails), each having the same , , , values. In addition, when a normal distribution was used to train ABC and any of the evaluated nonnormal distributions were used to test ABC, very nearly the same actual coverages (to within approximately ±0.01) were obtained. As another check of robustness, one prior distribution was used for training ABC and a prior that was wider by a factor of 1.3 was used in testing. In that case, the actual coverages dropped from approximately 0.95 to approximately 0.90, so this implementation is less robust to using a wider prior in testing than in training. Also, the RMSE of the ABC estimator was compared to each of several non-Bayesian constrained estimators that are currently being evaluated. Not surprisingly, provided that the prior used in training was approximately the same as that used in testing, the ABC estimator had lowest RMSE. The intent here is not to make any general RMSE performance claims; instead, these results provide a second indication that the ABC implementation is well calibrated, in the sense that if the assumed prior equals the true prior then the RMSE of the ABC estimator is highly competitive with that of the non-Bayesian estimator and in the sense that the width of the posterior accurately describes uncertainty. Finally, a wellcalibrated Bayesian analysis allows one to evaluate the effect  Figure 4 are wide, with no relative information regarding variance ratios assumed, and the parameter is assigned a prior with a mean value of 0.10, with a relative standard deviation in of 10%. And Figure 4 shows that, in this case, having only = 3 and = 3 per group does not lead to a narrow posterior, but, with = 10 and = 10, the posterior is fairly narrow.

Application of RSD Estimates: Statistical
Testing of Materials Accounting Data

Sequential Statistical Testing of MB Sequences.
Recall that NMA evaluates one or more MBs, where the MB is defined for balance period as MB = ( −1 + in, − out, ) − [38,39]. Typically, many measurements are combined to estimate the terms in , begin , out , and end in the MB; therefore, the central limit effect and years of experience suggest that MBs in most facilities will be approximately normally distributed with the mean equal to the true NM loss and standard deviation , which is expressed as ∼ ( , ), where denotes the MB and the notation is a shortened version of MB, . A sequence of MBs will be assumed to have approximately a multivariate normal distribution [38][39][40][41][42][43],  .
The magnitude of determines the amount of NM loss, = ∑ =1 , which leads to high detection probability (DP). For example, suppose the facility tests for NM loss only, not for NM gain, and assume that ∼ ( , ) is an adequate model. Then, if a false alarm probability (FAP) of = 0.05 is desired, the alarm threshold is 1.65 . In the case of testing for loss only, it follows that the loss detection probability 1 − for = 3.3 and 1 − > 0.95 if > 3.3 , where is the nondetection (false negative) probability. The factor 3.3 arises from symmetry of the Gaussian distribution, requiring = = 0.05, and the fact that 1.65 = 3.3/2 is the 0.95 quantile of the (0, 1) distribution. One common goal is for DP = 1− to be at least 0.95 if ≥ 1 SQ (significant quantity, which is 8 kg for Pu), which is accomplished if ≤ SQ/3.3. If > SQ/3.3, this can be mitigated by reducing measurement errors to achieve ≤ SQ/3.3 (if feasible) and/or by closing the balances more frequently, so there is less nuclear material transferred per balance period, which reduces [38,39]. The DP of other safeguards measures such as enhanced containment and surveillance with smart cameras and/or remote radiation detection is difficult to quantify and is outside the scope of this paper. This section concludes with four remarks. Remark 1. Large throughput facilities try to make as small as reasonably possible and often try to keep small as a percent of throughput but cannot achieve ≤ SQ/3.3. For example, suppose that there is a measurement error relative standard deviation of 0.5% of throughput. And suppose that the FAP/DP goals are = 0.05 and 1 − = 0.95 and annual throughput is 100 SQ. Then, = 0.5 SQ = SQ/2 > SQ/3.3, so protracted diversion of 1 SQ over one year will not have a high DP. Therefore, one reason for frequent MB accounting is that abrupt diversion over hours or days is very likely to be detected [40]. As a complementary approach that is beyond the scope here, process monitoring [39] methods have the potential to detect off-normal facility operation that could misdirect NM to undeclared locations.

Remark 2.
In the 1980s, some believed that a plant reporting a MB every 30 days would have higher DP than that same plant reporting a MB every year. However, [44] showed that, for optimal (from the diverter's viewpoint) protracted diversion with the per-period loss being proportional to the row sums of the covariance matrix Σ of the MB sequence, annual MB testing has higher DP than monthly MB testing, and so, for such protracted diversion, "less frequent balance closure is better." However, [44] conceded that NRTA has shorter detection times and higher DP against abrupt diversion. Publications [45][46][47][48][49][50] soon followed involving joint sequential tests: one tuned to detect protracted diversion and one more tuned for abrupt diversion. Such joint Page's tests can be tuned to have high DP against abrupt loss while still having reasonably high DPs against protracted loss. Two types of combined Page's tests are included in the simulation study in Section 4.3.
The simulation study in Section 4.3 includes a sensitivity study to assess the impact of estimation errorΣ on the estimated false alarm probabilities (FAPs) and detection probabilities (DPs).

Remark 4. This section focuses on operator MB sequences.
In international safeguards, the inspector randomly selects items to verify NM declarations made by the operator. The difference statistic, = ∑ =1 (( − )/ ), defined as the average difference in the sample of size n (extrapolated to the population of size N) between operator declarations (almost always based on operator measurements) and inspector measurements can be used as a test statistic, or the statistic [2] can be used to compute the inspector MB statistic (or sequence). Inspector MB = MB − , which could be analyzed using sequential statistical methods such as those in Section 4.3. Alternatively, one can test individually each of the paired differences = − and the overall statistic, and if none are found to be statistically significant, then the IAEA could rely on operator MB evaluation as in Section 4.3.

Propagation of Variance to Estimate Σ.
Estimating Σ is a key step required in frequent NMA. To illustrate, a simplified example model of a generic electrochemical facility with one input stream, one output stream, and one key inventory item will be used here [38]. Each measurement method is modeled here using a multiplicative measurement error model for the operator ( ): = (1 + + ), with ∼ (0, 2 ) and ∼ (0, 2 ), where is the operator's measured value of item , is the true but unknown value of item , is a random error of item , and is a short-term systematic error for item . Then, the diagonal terms of Σ are calculated as 2 = in 2 ( 2 in, + 2 in, ) + out 2 ( 2 out, + 2 out, ) The off-diagonal terms in Σ are calculated as 2 = in in 2 in, + out out 2 out, In the last two terms, the random error of the inventory term is only applied if the condition is true. Reference [31] showed that the POV results for 2 and 2 are obtained by appropriate interpretation of GUM's equation (1) in Section 2. For this simplified version of an example from [38], this leads to examples of 12-by-12 covariance matrices for monthly MBs over a one-year period; three example matrices Σ 1 , Σ 2 , and Σ 3 are listed next.
Three example covariance matrices, Σ 1 , Σ 2 , and Σ 3 , for a generic electrochemical facility, are given [38]: Equation (17) is Σ 3 , scaled to unit variance. This is the case with 4 kg input per period and 40 kg inventory and increased from 0.01 to 0.02.

Sequential Testing of Material Balance in Nuclear Material
Accountancy. The assumption ( 1 , 2 , . . . , ) ∼ ( , Σ) implies that = Σ −1/2 ∼ (Σ −1/2 , ), where is the identity matrix. The transform = Σ −1/2 is known in safeguards as the standardized independently transformed MUF (SITMUF, where MUF is another name for the MB), which is most conveniently computed using the Cholesky decomposition [43]. There are two main advantages of applying statistical tests to rather than to . First, alarm thresholds depend only on the sequence length for Y and not on the form of the covariance matrix Σ. Because it is best to calculate thresholds using simulation, this is a logistic advantage. Second, the variance of the sequence decreases over time, so if a diversion occurs late in the analysis period, the DP is larger for the sequence than for the sequence. Note that one cannot claim higher DP for the sequence than for the sequence in general, because the true loss scenario is never known, and the DP can be larger for than for for some loss scenarios, which is demonstrated in Section 4.
Several reasonable statistical tests have been evaluated in [38,41,[44][45][46][47][48][49][50][51] and are included in the simulation study in Section 4.4, including the following 13 tests: ( is the sum of all MUF values from period 1 to (8) GEMUF: it has been shown that if the loss vector is known, then the Neyman-Pearson test statistic is Σ −1 x, which is known as a matched filter in some literature. The GEMUF statistic substitutes x for , so GEMUF = x Σ −1 x. In simulation studies, is known, so the NP test statistic is useful for calculating the largest possible DP. The GE in GEMUF is a German abbreviation of Geschätzter, which means "estimated," so GEMUF means estimated MUF, and GEMUF is the same as the Mahalanobis distance from the 0 vector and Hotelling's multivariate statistic [51] (9) A nonsequential version of the Neyman-Pearson test, Σ −1 x, is useful to calculate the largest possible DP for given Σ and . For completeness, four other combined tests are also considered Remark 1 (SITMUF transform). The SITMUF transform is recommended for two reasons. First, simulation is typically used to select alarm thresholds, and it is convenient to always work on the same scale when selecting alarm thresholds, so the fact that = Σ −1/2 ∼ (Σ −1/2 , ) is convenient. Note that alarm thresholds could be selected on the basis of exact or approximate analytical results for some, but not all, of the tests. For example, there are approximate expressions for and k [53]. Second, the standard deviatioñis given bỹ= √ 2 − Σ −1 , where = Σ ,1:( −1) , the 1 to ( − 1) entries in the jth row of Σ, so the standard deviation of the MUF residuals decreases in the later periods. Therefore, the independence transform is analogous to a bias adjustment, leading to smaller prediction variance in later periods, which tends to increase the DP for SITMUF compared to MUF (there are exceptions where the DP for MUF is larger than the DP for SITMUF; see Section 4.4, DP results).
Remark 2 (choosing thresholds). Thresholds can be chosen in many ways and can be assumed to be constant for each period or not. Therefore, simulation DP results in Section 4.4 are not claimed to be optimal but are example DP results.
A sensitivity analysis was also performed by simulating 30% RSD in̂and 50% RSD in̂(see Section 3.2). With these relatively large RSDs in̂and̂, there is large uncertainty in the DP. For example, the 95% interval for DPs  Table 1 (0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0). The three least-sensitive DPs are in boldface and are for Page on SITMUF, combined Page on MUF, and combined Page on SITMUF. The most sensitive DP is also in boldface and underlined and is for CUMUF.

Summary
Statistical analyses used to support safeguards conclusions require UQ, usually by estimating the RSD in random and systematic errors associated with each measurement method. This paper reviewed why UQ is needed in nuclear safeguards and examined recent efforts to improve both bottom-up and top-down UQ for calibration data. A key issue in bottomup UQ using calibration with only a few calibration standards is that existing analytical approximations to estimate variance components are not sufficiently accurate, so this paper illustrated that simulation is needed. Once calibration UQ is well quantified, whenever improved bottom-up UQ predicts smaller measurement error RSDs than are observed in top-down UQ [1], this is evidence of significant unknown NDA error sources ("dark uncertainty," [9]), which can then potentially be identified.
The RSD of an assay method is often defined as the reproducibility standard deviation as estimated in an interlaboratory comparison. Recent options for top-down UQ, such as constrained estimators or Bayesian estimators that use prior information, offer possible improvements over existing variance component estimators. Any such improvements in estimated RSDs should be accompanied by uncertainties in the RSDs, which means that uncertainty in the estimated uncertainties matters [2,7]. The ABC approach in Section 3 appears to provide a robust estimate of the posterior distribution of the RSDs.
There are other types of bottom-up UQ used in safeguards not considered here. For example, the FRAM (fixedenergy, response function analysis with multiple efficiencies) gamma-based method [54] does not rely on calibration, and FRAM's uncertainties are impacted by physical mismatch between test items and assay assumptions, which leads to item-specific bias, and also by uncertainties in nuclear data such as half-lives. This paper also used simulation to evaluate the impact of uncertainty in measurement error RSDs on estimated nuclear material loss detection probabilities in sequences of measured material balances. Many different sequential statistical tests were evaluated. In a related context, [2] evaluated the impact of uncertainty in measurement error RSDs on estimated DPs in verification data.
= + + , where is the inspector's measured value of item (from 1 to ) in group (from 1 to ), is the true but unknown value of item from group , ∼ (0, 2 ) is a random error on item from group , and ∼ (0, 2 ) is a short-term systematic error in group .

Conflicts of Interest
The authors declare that they have no conflicts of interest.