Assessment of the Adequacy of Gauge Repeatability and Reproducibility Study Using a Monte Carlo Simulation

ANOVA gauge repeatability and reproducibility study is themost popular tool for measurement system analysis. Two experimental designs can be applied depending on the durability of the objects. If repeatedmeasurements are possible or sufficient homogeneous nonrepeatable samples are available, crossed design is appropriate; otherwise, nested design should be used. In this paper, we investigated the adequacy of ANOVA gauge repeatability and reproducibility study from the perspective of practitioners. We proposed a Monte Carlo simulation that is close to the realistic procedure to evaluate the adequacy of both structures. During the evaluation, we considered the average performance metrics, percentage of correct decision, histogram shape, and symmetric mean absolute percentage error for the four popular performancemetrics, namely, % StudyVariation, %Contribution, %Tolerance, and the number of distinct categories.The experimental results show that the nested design fails to judge the precision of the gauge while the crossed design succeeds.


Introduction
Gauge repeatability and reproducibility (GRR) study is a representative measurement system analysis (MSA) tool [1].Two factors determine the adequacy of a measurement system: accuracy, such as bias, linearity, stability, and correlation, and precision such as repeatability and reproducibility.The main concern of the GRR is that a measurement system has sufficient precision to measure the variation of the manufactured products or the manufacturing process under consideration.There are three conventional GRR methods; the range method, the average and range method using control chart, and the analysis of variance (ANOVA) GRR (AGRR) [1].After the AGRR was introduced by Montgomery and Runger [2,3], it became the most popular tool for MSA as it considers the interaction effects and provides interval estimates for the variance components and the performance metrics [4].The ANOVA in AGRR measures the variability of observations and estimates variance components.The performance metrics, which are composed of sums or ratios of the estimated variance components, provide the criteria used to analyze the precision for the measurement system.Crossed designs are standard experimental layouts for AGRR.Nested designs, or hierarchical designs, are used for nonrepeatable measurements such as a destructive test.Though the measured object is nonrepeatable, if sufficient homogeneous samples are captured, the crossed design will be appropriate [5].
For the last two decades, numerous studies have been conducted on AGRR.Previous studies mainly concentrated on providing theoretical backgrounds, introductions of the AGRR, efficient approximations for narrower confidence intervals of variance components and performance metrics, and variations of AGRR for special experimental structures.In fact, AGRR is a popular tool used in the industrial field.QS-9000, a quality standard of the American automotive industry, even provides a guideline for AGRR [6].From the practitioners' perspective, previous theoretical studies are not as such valuable.The main reason for this is the wide confidence intervals.Theoretical studies have mainly focused on developing efficient approximations on the wide confidence intervals.Since the AGRR is based on sampling, it is reasonable that the confidence intervals give clearer evidence than the point estimates to arrive at the correct conclusions.However, as Burdick and Larsen [4] noted, the estimated confidence intervals are often too wide to be used.In many cases, the confidence intervals overlap with the decision criteria of AGRR in Table 3, making them unsuitable for assessing the adequacy of the gauge.Therefore, practitioners choose the point estimates of the performance metrics.
In this situation, it is imperative to verify the adequacy of AGRR with the point estimates.Especially, as Bergeret et al. [7] mentioned, the adequacy of the nested design is quite doubtful.Though the theoretical basis of AGRR is very firm, there are several possible sources to harm the adequacy of the AGRR.The fundamental assumption of AGRR is that all effects including interaction follow a normal distribution and those are independent.If we filter outliers during inspection or select samples arbitrarily to secure a sufficient range of variability, the normality or the independent assumption breaks.The nested design itself is another source of decreasing adequacy.The nested effects interferes to separate the variance components clearly, and, consequently, the accuracy of the performance metrics decreases.
Practitioners are not concerned with theoretical derivations or proofs.Their primary concern with the AGRR is that the tool works properly to determine the precision of the gauge and, if possible, to find ways to improve the adequacy of the AGRR within budget constraints.Theoretical, and even practical, studies have not dealt with these issues.Existing practical studies only focused on offering user guidelines or providing case studies on various applications.The purpose of this paper is to evaluate the adequacy of the AGRR for both the crossed and nested designs and to investigate the causes of any inadequacies.To accomplish this, we constructed a series of Monte Carlo simulations and verified the adequacy via four popular performance metrics, % Study Variation, % Contribution, % Tolerance, and the number of distinct categories (NDC) [8].
Section 2 introduces the conventional AGRR process for both crossed and nested designs and compares the differences in formulas for the performance metrics.In Section 3, we briefly review existing references on the AGRR.The proposed Monte Carlo simulation method and experimental environments are described in detail in Section 4. We summarize the simulation results for various perspectives, show why the nested design AGRR is unsuitable for MSA, and reveal the cause of inadequacy in Section 5. Finally, in Section 6, the conclusions and further discussion of this paper are provided.

ANOVA Gauge Repeatability and Reproducibility Study
A standard AGRR uses the crossed design.The two-way random effect model is as follows: where   is an observation;  is the unknown overall mean; , , (), and  are random variables that represent the effects of the sample, the operator, the interaction between the sample and the operator, and the replicate, respectively; , , and  are the number of operators, samples, and replicates, respectively.It is generally assumed that   ∼ (0,  2  ),   ∼ (0,  2  ), ()  ∼ (0,  2  ), and  () ∼ (0,  2  ), and those are independent of each other.Left side of Figure 1 shows an experimental structure of the crossed design with 2 operators, 4 parts, and 2 replicates.In the figure, the number in the observations indicates .In the crossed design, two operators measure four distinct samples twice.
Table 1 shows the resulting ANOVA table.If the interaction effect is negligible, that is,   /  is less than the significant level, it is pooled into error terms; that is,  and the table are changed.For detailed explanation, refer to Montgomery and Runger [2,3].
We can estimate the variance components as follows: If the samples are destructive, we must apply the nested design instead of the standard crossed design.The right side of Figure 1 shows a nested experimental structure with two operators, four parts, and four replicates.The experiment is a counterpart of the crossed design on the left side with the same number of observations.In the experiment, two operators measure sixteen distinct samples in four batches (or lots).If the samples in a batch are homogeneous and enough samples are available, the crossed design can still be effective.The two-way random effects model for the nested design is as follows: where   is an observation and  is the unknown overall mean., (), and  are random variables that represent effects of operator, sample nested operator, and replicate, respectively., , and  are the number of operators, samples per operator, and replicates, respectively.It is also assumed that   ∼ (0,  2  ), ()  ∼ (0,  2 () ), and  () ∼ (0,  2  ) and that these are independent of each other.The estimates of the variance components are as follows: The goal of AGRR is to determine whether the measurement system can distinguish variation of products or processes properly.To do that, the AGRR extracts the gauge error (repeatability) and the operator error (reproducibility) from the observed measurements and judges the adequacy via performance metrics.The most popular performance metrics in practice are % Study Variation, % Contribution, % Tolerance, and the number of distinct categories (NDC).Popular software in the quality field, Minitab, provides the four metrics for AGRR [8].We summarize the calculation formula and relevant decision criteria of the metrics for the crossed design and nested design in Table 3.
According to the formulas in Table 3, under the assumption of the same estimates of the variance components, we can surmise that the values of all performance metrics of the nested design are superior to the crossed design.In the crossed design, the measurement variation σ2 & includes the interaction effect σ2  , and it makes the values worse.However, practical results oppose this theoretical analysis.Bergeret et al. [7] claimed that % Contribution and % Tolerance of the nested design are overestimated when compared with the crossed design.They investigated three case studies and argued that improper estimation of repeatability results in the overestimation of the performance metrics.In this situation, it is valuable to determine whether the AGRR, specially the nested design, is indeed an appropriate tool to determine precision of the gauge and to investigate how accurate the AGRR is.

Previous Studies
In this section, we briefly review previous studies on AGRR.The mainstream theoretical developments on AGRR include accurate and efficient approximation approaches for narrower confidence intervals of the variance components and performance metrics, methods for improving the accuracy of the AGRR, and the methods for nonrepeatable measurements.
Montgomery and Runger [2,3] introduced AGRR as an alternative to the conventional GRRs such as the range method and the average and range method.They also suggested a proper experimental design for AGRR and the Satterthwaite approximation for confidence intervals of the variance components.Borror et al. [9] compared two approximations, the restricted maximum likelihood estimation (REML) using SAS PROC MIXED, and a modified large sample (MLS) method to estimate confidence interval of  2 & for the two-way random effects model.They claimed that the REML is superior to the MLS due to the narrower confidence interval.Burdick and Larsen [4] compared five approximation approaches, that is, MLS [10], Satterthwaite approximation [3], AIAG [1], REML [11], and the Milliken and Johnson method [12] for five variations, namely,  2  ,  2  ,  2  ,  2 & , and  2  / 2 & .The simulation results showed that the MLS is superior to others since it satisfies the confidence coefficient in spite of wider confidence interval.Dolezal et al. [13] investigated the confidence interval of  2 & for a two-way mixed effects model with fixed operators.They suggested the mixed effects model for a limited number of operators because the interval length is shorter than conventional random effects model.Hamada and Weerahandi [14] proposed a modified generalized inference approximation for  2  & and argued that it provides a shorter confidence interval than the MLS [4].Chiang [15] also proposed an approximation for  2 & using the surrogate variables.He compared the confidence interval to the MLS and insisted that it is an effective general method for the balanced random effects model.Daniels et al. [16] employed the generalized confidence interval approach using a generalized pivotal quantity.They stated that the approach is superior to the MLS if  2  / 2 & is less than or equal to 0.2.Wang and Li [17] proposed a bootstrapping method that can estimate the confidence interval when the control chart GRR is applied.
As for the performance metrics and their confidence intervals, Burdick et al. [18] stated that the confidence interval of  2  / 2 & is too wide to be used, hence they recommended the Cochran method based on Satterthwaite approximation [19].Chiang [20] argued that the confidence coefficients of the MLS and Satterthwaite approximation for  2  / 2 & become low when  2  is less than 0.5.To overcome this phenomenon, he suggested the F-screened MLS that applies the MLS only when  2   is statistically significant.Burdick et al. [21] reviewed previous research on AGRR and stated that precision-to-tolerance ratio (% Tolerance), signal-to-noise ratio (SNR), and discrimination ratio (DR) are popular performance metrics for AGRR.Adamec and Burdick [22] compared the performance of the MLS and the generalized inference procedure for the DR in a three-way random effects model.Burdick et al. [23] proposed the generalized inference procedure for the misclassification rate.Woodall and Borror [24] reviewed and analyzed relationships for popular performance metrics, % Study Variation [1,8], NDC [1,8], SNR [1,21], DR [1], and misclassification rates.
There were a few studies on improving the accuracy of AGRR.Pan [6] calculated the optimal set of (, , ) to provide the shortest confidence interval of  2 & for variety combinations of  2  ,  2  , and  2  under the same number of observations.Browne et al. [25] proposed two-staged AGRR to increase adequacy of AGRR.At a baseline stage, a number of operators measure a sample to obtain an appropriate range of samples.Then, at the second stage, a standard AGRR is conducted with the samples.They argued that the approach provides shorter standard deviations for  & /  and   / & than the normal one-staged AGRR.Pan et al. [26] suggested a revised % Tolerance for the multivariate GRR that provides smaller mean squared error and mean absolute percentage error than the conventional % Tolerance.They also calculated the optimal (, , ) in terms of the new performance metric using the principal component analysis.
Research on the nested design is rare.Bergeret et al. [7] applied the nested design for three case studies with destructive samples and argued that the nested design overestimates % Study Variation and % Tolerance.Mast and Trip [27] introduced four assumptions for AGRR that are the consistency of bias, homogeneity of measurement errors, temporal stability of objects, and robustness against measurement.They defined a nonrepeatable measurement as the measurement that the last two assumptions are not satisfied.Furthermore, they proposed several alternative AGRRs that are suitable for various homogeneity assumptions.Van Der Meulen et al. [28] developed a compensation method to improve overestimation of the nested design for nonrepeatable measurements.
There are many practical studies on AGRR, but those mainly focused on introducing basic theoretical knowledge, suggesting systematic user guidelines, and providing case studies for various applications.We will skip the review for the references., where   is the degree of freedom of  and   is the mean of squares of .From the relationship, the population of   can be generated using Chi-squared distribution, and subsequently, the estimates of variance components and performance metrics can be calculated.This simulation approach, however, has two weaknesses.First, the normality assumption of effects must be satisfied.If we limit the random sampling by inspection processes or select samples or operators arbitrarily to obtain better results, we cannot generate the population due to the broken normality condition.Second, the true values of variance components and performance metrics are still unknown, therefore the adequacy of AGRR cannot be judged.Therefore, we propose a new Monte Carlo simulation for verification of the adequacy of AGRR.This approach generates populations of all effects instead of populations of the mean of the squares.An observation will then consist of the sampled effects from the populations.Since the true variance components can be calculated from the populations, intensive evaluation is possible.In addition, during the procedure, it is possible to employ various realistic constraints such as inspection, without loss of generality.The detailed procedure of the proposed Monte Carlo simulation is in Algorithm 1.

Experimental Setup
At the beginning of each scenario, the levels of σ2  , σ2  , σ2  , and σ2 are assigned.The total number of scenarios  scenario is 140 since the number of levels of σ2  is 14 in Table 4 and there are 10 combinations of σ2  , σ2  , and σ2  at each level of σ2  in Table 5.The population of each effect is generated by the truncated normal distribution bounded by ±3σ  .In practice, an inspection process may filter outliers of the products, so the truncated normal assumption is reasonable for the actual samples.Normal distribution can be used if this is insignificant.The size of each population  pop is 10,000 except for the interaction, which is 10,000 by 10,000.The true values of the performance metrics are computed by the crossed design formulas in Table 3 with population parameters,  2  ,  2  ,  2  , and  2  .A set of observations is generated by Equations ( 1) and ( 3) with the sampled effects from the populations.At this time, the structure of observations follows the experimental design in Table 6.Section 4.3 explains the experimental design in more detail.At next step, AGRRs using the crossed and nested designs estimate the variance components, where the significance level for determining pooling of the interaction is 0.05.For every AGRR, the performance metrics, % Study Variation, % Contribution, % Tolerance, and NDC are calculated by the formulas in Table 3 where the Tolerance is 6σ  .These steps are repeated  repeat (100) times.

Simulation Parameters. The sample variance 𝜎 2
affects the % Study Variation, % Contribution, and NDC among the  is fixed at three, the experimental design seems to be limited.However, except for % Tolerance, the performance metrics do not depend on the magnitude but the ratio among the variance components.In this perspective, our experimental design can cover a total of 140 scenarios and the number is sufficient.
For 1 to  scenario Do: Step 1. Assign a set of variance components, (σ 2  , σ2  , σ2  , σ2  ) according to the assigned experimental design in Tables 4 and 5.
Step 3. Calculate true variance components ( 2  ,  2  ,  2  ,  2  ) and the true values of performance metrics from the populations of effects.For 1 to  repeat Do: For 1 to  experimental design Do: Step 4. Generate observations.
Step 4-1.Select samples from each population of effects by the experimental design in Table 6.
Step 5-1.Perform the AGRR with the observations.
Step 5-2.Calculate variance components (σ 2  , σ2  , σ2  , σ2  ) Step 5-3.Calculate the performance metrics for the crossed design and the nested design, respectively.End Step 6. Calculate mean of performance metrics for the crossed design and the nested design.End End Algorithm 1

Experimental Design for the Crossed and Nested Designs.
Direct comparison of performance metrics between the crossed design and the nested design is meaningless because they are used in different environments.However, it is valuable to evaluate and compare the adequacy levels.To do that, we designed two structures that use the same observations, for the experiment.For example, in Figure 1, we use the same observations in both the designs.The experimental design to maintain the same number of observations for the crossed and the nested design is as follows.
The number of operators, samples, and replicates also affects the performance metrics.To investigate their effects, we set up a 2 3 factorial design for the number of operators using the crossed design (  ).The number of samples per operator in the nested design (  ) and the number of replicates in the crossed design (  ) are shown in Table 6.In order to match the number of observations, the number of operators in the nested design (  ), the number of samples in the crossed design (  ), and the number of replicates in the nested design (  ) are assigned   ,     , and     , respectively.

Performance Metrics.
Figure 2 shows the trajectories of the averages of the four performance metrics for the crossed design, the nested design, and the population over σ2 .The two dashed horizontal lines indicate the rule-of-thumb decision criterion of each performance metric.The regions I, II, and III represent the acceptable, the pending, and the unacceptable regions, respectively, based on the performance metrics of population.Since σ2  + σ2  + σ2  and tolerance are constants, the performance metrics of the crossed design are functions of σ2  .As for the nested design, the performance metrics are also functions of σ2 because averaging and orthogonal designs compensate the individual effects of σ2  , σ2  , and σ2  .In Figure 2, all average performance metrics of the crossed design are very close to the values of the population that are true values.It implies that the adequacy level of the crossed design is very high.However, the metrics of the nested design differ from the values of the population.In particular, the trajectory of nested design for the % Tolerance differs from the trajectory of the population.It is caused by the overestimation of σ2  (Section 5.4.will elaborate on this).Moreover, the population shares the formulas for performance metrics with the crossed design.The  2   is nested into  2  in the nested design but it is added to  2 & in the crossed design and the population.In theory, the gap between the nested design and the population should decrease in region III since σ2 is much bigger than σ2  .Region III is important because the metrics of a good measurement system are positioned at the region.Therefore, we can conclude that nested design does not provide the correct result on gauge precision.

Percentage of Correct Decision.
To investigate the nested design further, we employed a new metric, the percentage of correct decision (PCD).The PCD is the percentage ratio of the same decision of the crossed or nested designs, to the population.Figure 3 shows the PCD over σ2 for each performance metric and shows that the vertical dashed lines and the regions are equivalent to Figure 2. The PCD of the crossed design decreases to about 50% around the borderlines of decision criteria and close to one at the other range of σ2  .It is reasonable because even a small variation of the performance metrics at the border lines results in a different decision.If the AGRR works correctly, the PCD must be close to 100% except for the borderlines.However, the trajectories of the PCD of the nested design decrease rapidly up to almost zero and do not recover to 100% until a range of σ2  .In region III, those are about 60% (in case of NDC, 70%).It implies that the decision of the nested design is almost random.

Effects of the Allocation of the Variance Components and the Experimental Design.
In this subsection, we investigate the causes of the poor quality of the nested design in the perspective of allocation of variance components and the experimental design.Figure 4 shows the % Study Variation for the allocations of variance components (σ 2  , σ2  , σ2  ) in Table 5, and Figure 5 shows % Study Variation for the experimental designs, (  ,   ,   ) and (  ,   ,   ), in Table 6.As mentioned in Section 5.1, the metric should be close to the population as σ2  increases.However, the gap is still large in the region III irrespective of the allocation of σ2  , σ2  , and σ2  , while the gap of the crossed design is close to zero.This phenomenon is very similar to that in the experimental design.In general, increasing degree of freedom improves the estimation quality of variance components in AGRR; thus it decreases the gap to the true value.However, in Figure 6, changing the experimental design does not improve the gap of the nested design significantly.The adequacy of the  nested design is still very low at all regions.From the above results, we can conclude that the allocation of the variance components and the experimental design are not critical to the poor adequacy of the nested design.
It coincides with the theoretical result.However, this is not so with reproducibility.In the case of the crossed design, the shape of the histogram still seems to be Chi-squared distribution.However, as shown in Figure 7, the histograms of the nested design differ, which have many zeroes and spread widely over all ranges.The zeroes of reproducibility make the effect of the operator statistically insignificant; hence, the results of AGRR are unstable.This implies that the estimation of reproducibility of the nested design is inadequate and it could be the main reason for the inadequacy of the design.Table 7 shows the estimates of the variance components at fourteen σ2  s for the population, the crossed design, and Mathematical Problems in Engineering  the nested design.All estimates of the crossed design are very close to the estimates of the population.On the other hand, the AGRR of the nested design overestimates the reproducibility, σ2  , while it properly estimates the repeatability and σ2 + σ2  .The overestimation of σ2  increases according to  2  .Since several estimates of σ2  are zero in Figure 7, its overestimation implies that the nonzero estimates are very large.In the nested design, an operator does not share samples with other operators; therefore, it is hard to separate the variability of the operator and the sample in ANOVA effectively.That is, the variability of  2  leaks to estimates of σ2  , and, consequently, it increases σ2  and makes the adequacy low.

Evaluation of Robustness.
The AGRR is based on sampling statistics.Even though the population is the same, each run of the AGRR can differ.If an AGRR method is reliable and robust, the variance of the results should be small.To investigate the robustness of AGRR, we employed the symmetric mean absolute percentage error (sMAPE) as follows [29]: where σ2 , denotes estimates of  with the ith set of observations and  2 , is the true value of  from the population,  is the number of repetition, and  is the index of scenario.sMAPE is a non-scale dependent variability measure; therefore the range is from −200% to 200%.However, since the values of σ2 , and  2 , are all nonnegative, in our case, sMAPE  is nonnegative.The smaller the sMAPE  , the lower the variability.Figure 8 shows the averaged sMAPE  of repeatability, reproducibility, &, and σ2 for the crossed design or σ2  + σ2  for the nested design.As for the crossed design, all averaged sMAPE  are very stable.On the other hand, the averaged sMAPE  of the nested design are unstable except repeatability.The severity of the reproducibility and, consequently, the  & are critical in regions II and III, where the value is over 150%.This result implies that the reproducibility of the nested design is not robust.

Conclusions
In this paper, we evaluated the adequacy of the AGRR from the perspective of practitioners.To this end, we designed a Monte Carlo simulation, which is different from conventional approaches but close to the actual AGRR process.We considered and compared two main experimental structures, crossed and nested designs, regarding four popular performance metrics for various combinations of  2  ,  2  ,  2  , and  2  .The experimental results show that the adequacy of the crossed design is appropriate for all the evaluation perspectives: the average performance metrics, PCD, histogram shape, and sMAPE.However, the adequacy of the nested design is very low for all evaluation terms.We revealed that inadequacy comes from the overestimation of σ2  .We tried to solve this problem by increasing the number of operators, but the problem is still unsolved.In conclusion, we highly recommend not applying the nested design as a tool of AGRR unless a solution to this problem is found.This solution could be any compensation coefficient of σ2  or other experimental design.This could be a topic of further research.

4. 1 .
Procedure for Simulation Experiment.Adequacy implies the ability to perform the desired goal.The purpose of AGRR is to determine the sufficiency of precision of a measurement

Figure 2 :
Figure 2: Average performance metrics over the sample variance.

Figure 3 :
Figure 3: Percentage of correct decision by performance index for all populations.

Figure 4 :
Figure 4: Average % Study Variation for the initial variance components.

Figure 5 :
Figure 5: Average % Study Variation for the experimental designs.

Figure 6 :
Figure 6: Histogram of sample repeatability from four different populations.

Figure 7 :
Figure 7: Histogram of sample reproducibility from four different populations.

Table 2
shows the ANOVA Table of the nested design.

Table 1 :
ANOVA table of the crossed design under two-way random effects model.

Table 2 :
ANOVA table of the nested design under two-way random effects model.

Table 3 :
Variance components, performance metrics, and their decision criteria of the crossed design and the nested design.To verify the adequacy of AGRR, we require information on the population, in other words, the true values must be known.However, it is impossible to obtain complete data for a population.One alternative, for overcoming this problem, is to use simulation.Most existing studies applied a Monte Carlo simulation for verification.If an effect  ∼ (0,  2  ), then     /  ∼  2 (  ) or   ∼    2 (  )/

Table 6 :
Levels of factors of experimental design.

Table 7 :
Average estimates of variance components of the population, the crossed design, and the nested design.