MPE Mathematical Problems in Engineering 1563-5147 1024-123X Hindawi 10.1155/2017/7237486 7237486 Research Article Assessment of the Adequacy of Gauge Repeatability and Reproducibility Study Using a Monte Carlo Simulation http://orcid.org/0000-0002-4222-2555 Ha Chunghun 1 Kim David S. 2 http://orcid.org/0000-0002-4556-2627 Park SeJoon 3 Cortés J.-C. 1 School of Information & Computer Engineering Hongik University 94 Wausan-ro Mapo-gu Seoul 04066 Republic of Korea hongik.ac.kr 2 School of Mechanical Industrial and Manufacturing Engineering Oregon State University Corvallis OR 97330 USA oregonstate.edu 3 Department of Industrial and Management Engineering Myongji University 116 Myonggi-Ro Cheoin-Gu Yongin-Si Gyeonggi-Do 449-728 Republic of Korea mju.ac.kr 2017 1582017 2017 07 03 2017 20 06 2017 1582017 2017 Copyright © 2017 Chunghun Ha et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

ANOVA gauge repeatability and reproducibility study is the most popular tool for measurement system analysis. Two experimental designs can be applied depending on the durability of the objects. If repeated measurements are possible or sufficient homogeneous nonrepeatable samples are available, crossed design is appropriate; otherwise, nested design should be used. In this paper, we investigated the adequacy of ANOVA gauge repeatability and reproducibility study from the perspective of practitioners. We proposed a Monte Carlo simulation that is close to the realistic procedure to evaluate the adequacy of both structures. During the evaluation, we considered the average performance metrics, percentage of correct decision, histogram shape, and symmetric mean absolute percentage error for the four popular performance metrics, namely, % Study Variation, % Contribution, % Tolerance, and the number of distinct categories. The experimental results show that the nested design fails to judge the precision of the gauge while the crossed design succeeds.

National Research Foundation of Korea NRF-2013R1A1A2006947
1. Introduction

Gauge repeatability and reproducibility (GRR) study is a representative measurement system analysis (MSA) tool . Two factors determine the adequacy of a measurement system: accuracy, such as bias, linearity, stability, and correlation, and precision such as repeatability and reproducibility. The main concern of the GRR is that a measurement system has sufficient precision to measure the variation of the manufactured products or the manufacturing process under consideration. There are three conventional GRR methods; the range method, the average and range method using control chart, and the analysis of variance (ANOVA) GRR (AGRR) . After the AGRR was introduced by Montgomery and Runger [2, 3], it became the most popular tool for MSA as it considers the interaction effects and provides interval estimates for the variance components and the performance metrics . The ANOVA in AGRR measures the variability of observations and estimates variance components. The performance metrics, which are composed of sums or ratios of the estimated variance components, provide the criteria used to analyze the precision for the measurement system. Crossed designs are standard experimental layouts for AGRR. Nested designs, or hierarchical designs, are used for nonrepeatable measurements such as a destructive test. Though the measured object is nonrepeatable, if sufficient homogeneous samples are captured, the crossed design will be appropriate .

For the last two decades, numerous studies have been conducted on AGRR. Previous studies mainly concentrated on providing theoretical backgrounds, introductions of the AGRR, efficient approximations for narrower confidence intervals of variance components and performance metrics, and variations of AGRR for special experimental structures. In fact, AGRR is a popular tool used in the industrial field. QS-9000, a quality standard of the American automotive industry, even provides a guideline for AGRR . From the practitioners’ perspective, previous theoretical studies are not as such valuable. The main reason for this is the wide confidence intervals. Theoretical studies have mainly focused on developing efficient approximations on the wide confidence intervals. Since the AGRR is based on sampling, it is reasonable that the confidence intervals give clearer evidence than the point estimates to arrive at the correct conclusions. However, as Burdick and Larsen  noted, the estimated confidence intervals are often too wide to be used. In many cases, the confidence intervals overlap with the decision criteria of AGRR in Table 3, making them unsuitable for assessing the adequacy of the gauge. Therefore, practitioners choose the point estimates of the performance metrics.

In this situation, it is imperative to verify the adequacy of AGRR with the point estimates. Especially, as Bergeret et al.  mentioned, the adequacy of the nested design is quite doubtful. Though the theoretical basis of AGRR is very firm, there are several possible sources to harm the adequacy of the AGRR. The fundamental assumption of AGRR is that all effects including interaction follow a normal distribution and those are independent. If we filter outliers during inspection or select samples arbitrarily to secure a sufficient range of variability, the normality or the independent assumption breaks. The nested design itself is another source of decreasing adequacy. The nested effects interferes to separate the variance components clearly, and, consequently, the accuracy of the performance metrics decreases.

Practitioners are not concerned with theoretical derivations or proofs. Their primary concern with the AGRR is that the tool works properly to determine the precision of the gauge and, if possible, to find ways to improve the adequacy of the AGRR within budget constraints. Theoretical, and even practical, studies have not dealt with these issues. Existing practical studies only focused on offering user guidelines or providing case studies on various applications. The purpose of this paper is to evaluate the adequacy of the AGRR for both the crossed and nested designs and to investigate the causes of any inadequacies. To accomplish this, we constructed a series of Monte Carlo simulations and verified the adequacy via four popular performance metrics, % Study Variation, % Contribution, % Tolerance, and the number of distinct categories (NDC) .

Section 2 introduces the conventional AGRR process for both crossed and nested designs and compares the differences in formulas for the performance metrics. In Section 3, we briefly review existing references on the AGRR. The proposed Monte Carlo simulation method and experimental environments are described in detail in Section 4. We summarize the simulation results for various perspectives, show why the nested design AGRR is unsuitable for MSA, and reveal the cause of inadequacy in Section 5. Finally, in Section 6, the conclusions and further discussion of this paper are provided.

2. ANOVA Gauge Repeatability and Reproducibility Study

A standard AGRR uses the crossed design. The two-way random effect model is as follows:(1)yijk=μ+Oi+Sj+SOij+Eijk,i1,2,,o,j1,2,,s,k1,2,,r,where yijk is an observation; μ is the unknown overall mean; S,O,SO,andE  are random variables that represent the effects of the sample, the operator, the interaction between the sample and the operator, and the replicate, respectively; o,s,andr are the number of operators, samples, and replicates, respectively. It is generally assumed that Oi~N(0,σO2), Sj~N(0,σS2), (SO)ij~N(0,σSO2), and Eijk~N0,σE2, and those are independent of each other. Left side of Figure 1 shows an experimental structure of the crossed design with 2 operators, 4 parts, and 2 replicates. In the figure, the number in the observations indicates ijk. In the crossed design, two operators measure four distinct samples twice.

Examples of experimental designs for a crossed design and a nested design.

Table 1 shows the resulting ANOVA table. If the interaction effect is negligible, that is, VSO/VE is less than the significant level, it is pooled into error terms; that is, E and the table are changed. For detailed explanation, refer to Montgomery and Runger [2, 3].

ANOVA table of the crossed design under two-way random effects model.

Source of variability Sum of square (S) Degrees of freedom (d) Mean of square (V) Expected mean square F
Operator (O) S O d O = o - 1 V O = S O d O θ O = σ E 2 + r σ S O 2 + o r σ O 2 V O / V S O
Sample (S) S S d S = s - 1 V S = S S d S θ S = σ E 2 + r σ S O 2 + s r σ S 2 V S / V S O
Interaction: S × O S S O d S O = ( s - 1 ) ( o - 1 ) V S O = S S O d S O θ S O = σ E 2 + r σ S O 2 V S O / V E
Replicate (E) S E d E = s o ( r - 1 ) V E = S E d E θ E = σ E 2

Total S T s o r - 1

We can estimate the variance components as follows: (2)σ^E2=VE,σ^SO2=VSO-VEr,σ^S2=VS-VSOsr,σ^O2=VO-VSOor.

If the samples are destructive, we must apply the nested design instead of the standard crossed design. The right side of Figure 1 shows a nested experimental structure with two operators, four parts, and four replicates. The experiment is a counterpart of the crossed design on the left side with the same number of observations. In the experiment, two operators measure sixteen distinct samples in four batches (or lots). If the samples in a batch are homogeneous and enough samples are available, the crossed design can still be effective. The two-way random effects model for the nested design is as follows:(3)yijk=μ+Oi+SOji+Eijk,i1,2,,o,j1,2,,s,k1,2,,r,where yijk is an observation and μ is the unknown overall mean. O,SO,andE  are random variables that represent effects of operator, sample nested operator, and replicate, respectively. o, s, and r are the number of operators, samples per operator, and replicates, respectively. It is also assumed that Oi~N(0,σO2), S(O)ij~N(0,σS(O)2), and Eijk~N0,σE2 and that these are independent of each other. Table 2 shows the ANOVA Table of the nested design.

ANOVA table of the nested design under two-way random effects model.

Source of variability Sum of square (S) Degrees of freedom (d) Mean of square (V) Expected mean square F
Operator (O) S O d O = o - 1 V O = S O d O θ O = σ E 2 + r σ S ( O ) 2 + o r σ O 2 V O / V S ( O )
Sample (Operator) (S(O)) S S ( O ) d S O = o ( s - 1 ) V S O = S S O d S O θ S ( O ) = σ E 2 + r σ S ( O ) 2 V S ( O ) / V E
Replicates (E) S E d E = s o ( r - 1 ) V E = S E d E θ E = σ E 2

Total S T s o r - 1

Variance components, performance metrics, and their decision criteria of the crossed design and the nested design.

Components Equation Crossed design Nested design Criteria
Variance Repeatability σ ^ R P T 2 σ ^ E 2 σ ^ E 2
Reproducibility σ ^ R P D 2 σ ^ O 2 + σ ^ S O 2 σ ^ O 2
R & R σ ^ R & R 2 = σ ^ R P T 2 + σ ^ R P D 2 σ ^ E 2 + σ ^ O 2 + σ ^ S O 2 σ ^ E 2 + σ ^ O 2
Sample σ ^ S 2 o r σ ^ S ( O ) 2 σ ^ S 2 σ ^ S ( O ) 2 = σ ^ S 2 + σ ^ S O 2
Total σ ^ T 2 σ ^ S 2 + σ ^ E 2 + σ ^ O 2 + σ ^ S O 2 σ ^ S 2 + σ ^ E 2 + σ ^ O 2 + σ ^ S O 2

Performance metric % Study Variation σ ^ R & R / σ ^ T × 100 σ ^ E 2 + σ ^ O 2 + σ ^ S O 2 / σ ^ S 2 + σ ^ E 2 + σ ^ O 2 + σ ^ S O 2 × 100 σ ^ E 2 + σ ^ O 2 / σ ^ S 2 + σ ^ E 2 + σ ^ O 2 + σ ^ S O 2 × 100 ≤10% acceptable10~30% pending>30% unacceptable
% Contribution σ ^ R & R 2 / σ ^ T 2 × 100 σ ^ E 2 + σ ^ O 2 + σ ^ S O 2 / σ ^ S 2 + σ ^ E 2 + σ ^ O 2 + σ ^ S O 2 × 100 σ ^ E 2 + σ ^ O 2 / σ ^ S 2 + σ ^ E 2 + σ ^ O 2 + σ ^ S O 2 × 100 ≤1% acceptable1~9% pending>9% unacceptable
% Tolerance 5.15 × σ ^ R & R 2 / T o l e r a n c e × 100 5.15 × σ ^ E 2 + σ ^ O 2 + σ ^ S O 2 / T o l e r a n c e 5.15 × σ ^ E 2 + σ ^ O 2 / T o l e r a n c e ≤10% acceptable10~30% pending>30% unacceptable
Number of distinct category (NDC) 1.41 × σ ^ S / σ ^ R & R 1.41 × σ ^ S 2 / σ ^ E 2 + σ ^ O 2 + σ ^ S O 2 1.41 × σ ^ S 2 + σ ^ S O 2 / σ ^ E 2 + σ ^ O 2 ≥5 acceptable<5 unacceptable

The estimates of the variance components are as follows:(4)σ^E2=VE,σ^SO2=VSO-VEr,σ^O2=VO-VSOor.

The goal of AGRR is to determine whether the measurement system can distinguish variation of products or processes properly. To do that, the AGRR extracts the gauge error (repeatability) and the operator error (reproducibility) from the observed measurements and judges the adequacy via performance metrics. The most popular performance metrics in practice are % Study Variation, % Contribution, % Tolerance, and the number of distinct categories (NDC). Popular software in the quality field, Minitab, provides the four metrics for AGRR . We summarize the calculation formula and relevant decision criteria of the metrics for the crossed design and nested design in Table 3.

According to the formulas in Table 3, under the assumption of the same estimates of the variance components, we can surmise that the values of all performance metrics of the nested design are superior to the crossed design. In the crossed design, the measurement variation σ^R&R2 includes the interaction effect σ^SO2, and it makes the values worse. However, practical results oppose this theoretical analysis. Bergeret et al.  claimed that % Contribution and % Tolerance of the nested design are overestimated when compared with the crossed design. They investigated three case studies and argued that improper estimation of repeatability results in the overestimation of the performance metrics. In this situation, it is valuable to determine whether the AGRR, specially the nested design, is indeed an appropriate tool to determine precision of the gauge and to investigate how accurate the AGRR is.

3. Previous Studies

In this section, we briefly review previous studies on AGRR. The mainstream theoretical developments on AGRR include accurate and efficient approximation approaches for narrower confidence intervals of the variance components and performance metrics, methods for improving the accuracy of the AGRR, and the methods for nonrepeatable measurements.

Montgomery and Runger [2, 3] introduced AGRR as an alternative to the conventional GRRs such as the range method and the average and range method. They also suggested a proper experimental design for AGRR and the Satterthwaite approximation for confidence intervals of the variance components. Borror et al.  compared two approximations, the restricted maximum likelihood estimation (REML) using SAS PROC MIXED, and a modified large sample (MLS) method to estimate confidence interval of σR&R2 for the two-way random effects model. They claimed that the REML is superior to the MLS due to the narrower confidence interval. Burdick and Larsen  compared five approximation approaches, that is, MLS , Satterthwaite approximation , AIAG , REML , and the Milliken and Johnson method  for five variations, namely, σE2, σO2, σRPD2, σR&R2, and σS2/σR&R2. The simulation results showed that the MLS is superior to others since it satisfies the confidence coefficient in spite of wider confidence interval. Dolezal et al.  investigated the confidence interval of σR&R2 for a two-way mixed effects model with fixed operators. They suggested the mixed effects model for a limited number of operators because the interval length is shorter than conventional random effects model. Hamada and Weerahandi  proposed a modified generalized inference approximation for σR&R2 and argued that it provides a shorter confidence interval than the MLS . Chiang  also proposed an approximation for σR&R2 using the surrogate variables. He compared the confidence interval to the MLS and insisted that it is an effective general method for the balanced random effects model. Daniels et al.  employed the generalized confidence interval approach using a generalized pivotal quantity. They stated that the approach is superior to the MLS if σO2/σR&R2 is less than or equal to 0.2. Wang and Li  proposed a bootstrapping method that can estimate the confidence interval when the control chart GRR is applied.

As for the performance metrics and their confidence intervals, Burdick et al.  stated that the confidence interval of σS2/σR&R2 is too wide to be used, hence they recommended the Cochran method based on Satterthwaite approximation . Chiang  argued that the confidence coefficients of the MLS and Satterthwaite approximation for σS2/σR&R2 become low when σS2 is less than 0.5. To overcome this phenomenon, he suggested the F-screened MLS that applies the MLS only when σS2 is statistically significant. Burdick et al.  reviewed previous research on AGRR and stated that precision-to-tolerance ratio (% Tolerance), signal-to-noise ratio (SNR), and discrimination ratio (DR) are popular performance metrics for AGRR. Adamec and Burdick  compared the performance of the MLS and the generalized inference procedure for the DR in a three-way random effects model. Burdick et al.  proposed the generalized inference procedure for the misclassification rate. Woodall and Borror  reviewed and analyzed relationships for popular performance metrics, % Study Variation [1, 8], NDC [1, 8], SNR [1, 21], DR , and misclassification rates.

There were a few studies on improving the accuracy of AGRR. Pan  calculated the optimal set of (o,s,r) to provide the shortest confidence interval of σR&R2 for variety combinations of σO2, σSO2, and σE2 under the same number of observations. Browne et al.  proposed two-staged AGRR to increase adequacy of AGRR. At a baseline stage, a number of operators measure a sample to obtain an appropriate range of samples. Then, at the second stage, a standard AGRR is conducted with the samples. They argued that the approach provides shorter standard deviations for σR&R/σT and σO/σR&R than the normal one-staged AGRR. Pan et al.  suggested a revised % Tolerance for the multivariate GRR that provides smaller mean squared error and mean absolute percentage error than the conventional % Tolerance. They also calculated the optimal (o,s,r) in terms of the new performance metric using the principal component analysis.

Research on the nested design is rare. Bergeret et al.  applied the nested design for three case studies with destructive samples and argued that the nested design overestimates % Study Variation and % Tolerance. Mast and Trip  introduced four assumptions for AGRR that are the consistency of bias, homogeneity of measurement errors, temporal stability of objects, and robustness against measurement. They defined a nonrepeatable measurement as the measurement that the last two assumptions are not satisfied. Furthermore, they proposed several alternative AGRRs that are suitable for various homogeneity assumptions. Van Der Meulen et al.  developed a compensation method to improve overestimation of the nested design for nonrepeatable measurements.

There are many practical studies on AGRR, but those mainly focused on introducing basic theoretical knowledge, suggesting systematic user guidelines, and providing case studies for various applications. We will skip the review for the references.

4. Experimental Setup 4.1. Procedure for Simulation Experiment

Adequacy implies the ability to perform the desired goal. The purpose of AGRR is to determine the sufficiency of precision of a measurement system by statistics. To verify the adequacy of AGRR, we require information on the population, in other words, the true values must be known. However, it is impossible to obtain complete data for a population. One alternative, for overcoming this problem, is to use simulation. Most existing studies applied a Monte Carlo simulation for verification. If an effect Q~N0,σQ2, then dQVQ/θQ~χ2dQ or VQ~θQχ2dQ/dQ, where dQ is the degree of freedom of Q and VQ is the mean of squares of Q. From the relationship, the population of VQ can be generated using Chi-squared distribution, and subsequently, the estimates of variance components and performance metrics can be calculated. This simulation approach, however, has two weaknesses. First, the normality assumption of effects must be satisfied. If we limit the random sampling by inspection processes or select samples or operators arbitrarily to obtain better results, we cannot generate the population due to the broken normality condition. Second, the true values of variance components and performance metrics are still unknown, therefore the adequacy of AGRR cannot be judged. Therefore, we propose a new Monte Carlo simulation for verification of the adequacy of AGRR. This approach generates populations of all effects instead of populations of the mean of the squares. An observation will then consist of the sampled effects from the populations. Since the true variance components can be calculated from the populations, intensive evaluation is possible. In addition, during the procedure, it is possible to employ various realistic constraints such as inspection, without loss of generality. The detailed procedure of the proposed Monte Carlo simulation is in Algorithm 1.

<bold>Algorithm 1</bold>

For 1 to Nscenario Do:

Step 1. Assign a set of variance components, (σ~O2,σ~S2,σ~SO2,σ~E2) according to the assigned experimental design in

Tables 4 and 5.

Step 2. Generate Spop size of populations of effects with TN(0,σ~O2), TN(0,σ~S2), TN(0,σ~SO2), and  TN0,σ~E2, where

T N denotes a truncated Normal distribution bounded by -3σ~Q,+3σ~Q for QO,S,SO,E.

Step 3. Calculate true variance components (σO2,σS2,σSO2,σE2) and the true values of performance metrics from

the populations of effects.

For 1 to Nrepeat Do:

For 1 to Nexperimental design Do:

Step 4. Generate observations.

Step 4-1. Select samples from each population of effects by the experimental design in Table 6.

Step 4-2. Generate observations by the structural models of Equation (1) and (3), respectively.

Step 5.

Step 5-1. Perform the AGRR with the observations.

Step 5-2. Calculate variance components (σ^O2,σ^S2,σ^SO2,σ^E2)

Step 5-3. Calculate the performance metrics for the crossed design and the nested design, respectively.

End

Step 6. Calculate mean of performance metrics for the crossed design and the nested design.

End

End

Levels of σ~S2 and σ~O2+σ~SO2+σ~E2 for population generation.

σ ~ S 2 σ ~ O 2 + σ ~ S O 2 + σ ~ E 2
8192 2 13 3
4096 2 12 3
2048 2 11 3
1024 2 10 3
512 2 9 3
256 2 8 3
128 2 7 3
64 2 6 3
32 2 5 3
16 2 4 3
8 2 3 3
4 2 2 3
2 2 1 3
1 2 0 3

Levels of σ~O2, σ~SO2, and σ~E2 for population generation.

σ ~ O 2 σ ~ S O 2 σ ~ E 2
1 1 1
0.8 1.4 0.8
1.4 0.8 0.8
0.8 0.8 1.4
1 1.4 0.6
0.6 1.4 1
1 0.6 1.4
1.4 0.6 1
1.4 1 0.6
0.6 1 1.4

Levels of factors of experimental design.

Factors Level (low, high)
Crossed design Nested design
Number of operators o C { 3,5 } o N = o C
Number of samples per operator s C = o C s N s N { 3,5 }
Number of replicates r C { 2,4 } r N = o N r C
Number of observations o C s C r C o N s N r N

At the beginning of each scenario, the levels of σ~S2, σ~O2, σ~SO2, and σ~E2 are assigned. The total number of scenarios Nscenario is 140 since the number of levels of σ~S2 is 14 in Table 4 and there are 10 combinations of σ~O2,σ~SO2,andσ~E2 at each level of σ~S2 in Table 5. The population of each effect is generated by the truncated normal distribution bounded by ±3σ~Q. In practice, an inspection process may filter outliers of the products, so the truncated normal assumption is reasonable for the actual samples. Normal distribution can be used if this is insignificant. The size of each population Spop is 10,000 except for the interaction, which is 10,000 by 10,000. The true values of the performance metrics are computed by the crossed design formulas in Table 3 with population parameters, σS2, σO2,σSO2,andσE2. A set of observations is generated by Equations (1) and (3) with the sampled effects from the populations. At this time, the structure of observations follows the experimental design in Table 6. Section 4.3 explains the experimental design in more detail. At next step, AGRRs using the crossed and nested designs estimate the variance components, where the significance level for determining pooling of the interaction is 0.05. For every AGRR, the performance metrics, % Study Variation, % Contribution, % Tolerance, and NDC are calculated by the formulas in Table 3 where the Tolerance is 6σ~S. These steps are repeated Nrepeat(100) times.

4.2. Simulation Parameters

The sample variance σS2 affects the % Study Variation, % Contribution, and NDC among the four performance metrics. To investigate the effect of σS2, we set σ~S2 from 20  (1) to 213 (8192) in Table 4, while σ~O2+σ~SO2+σ~E2 is three. The variations, σO2, σSO2, and σE2 affect all the performance metrics. To analyze the effects of the variance components, ten orthogonal sets of σ~O2,σ~SO2,andσ~E2 are designed as shown in Table 5. Since σ~O2+σ~SO2+σ~E2 is fixed at three, the experimental design seems to be limited. However, except for % Tolerance, the performance metrics do not depend on the magnitude but the ratio among the variance components. In this perspective, our experimental design can cover a total of 140 scenarios and the number is sufficient.

4.3. Experimental Design for the Crossed and Nested Designs

Direct comparison of performance metrics between the crossed design and the nested design is meaningless because they are used in different environments. However, it is valuable to evaluate and compare the adequacy levels. To do that, we designed two structures that use the same observations, for the experiment. For example, in Figure 1, we use the same observations in both the designs. The experimental design to maintain the same number of observations for the crossed and the nested design is as follows.

The number of operators, samples, and replicates also affects the performance metrics. To investigate their effects, we set up a 23 factorial design for the number of operators using the crossed design (oC). The number of samples per operator in the nested design (sN) and the number of replicates in the crossed design rC are shown in Table 6. In order to match the number of observations, the number of operators in the nested design (oN), the number of samples in the crossed design (sC), and the number of replicates in the nested design (rN) are assigned oC, oCsN, and oNrC, respectively.

5. Experimental Results 5.1. Performance Metrics

Figure 2 shows the trajectories of the averages of the four performance metrics for the crossed design, the nested design, and the population over σ~S2. The two dashed horizontal lines indicate the rule-of-thumb decision criterion of each performance metric. The regions I, II, and III represent the acceptable, the pending, and the unacceptable regions, respectively, based on the performance metrics of population. Since σ~O2+σ~SO2+σ~E2 and tolerance are constants, the performance metrics of the crossed design are functions of σ^S2. As for the nested design, the performance metrics are also functions of σ^S2 because averaging and orthogonal designs compensate the individual effects of σ^O2,σ^SO2,andσ^E2. In Figure 2, all average performance metrics of the crossed design are very close to the values of the population that are true values. It implies that the adequacy level of the crossed design is very high. However, the metrics of the nested design differ from the values of the population. In particular, the trajectory of nested design for the % Tolerance differs from the trajectory of the population. It is caused by the overestimation of σ^S2 (Section 5.4. will elaborate on this). Moreover, the population shares the formulas for performance metrics with the crossed design. The σSO2 is nested into σS2 in the nested design but it is added to σR&R2 in the crossed design and the population. In theory, the gap between the nested design and the population should decrease in region III since σ^S2 is much bigger than σ^SO2. Region III is important because the metrics of a good measurement system are positioned at the region. Therefore, we can conclude that nested design does not provide the correct result on gauge precision.

Average performance metrics over the sample variance.

5.2. Percentage of Correct Decision

To investigate the nested design further, we employed a new metric, the percentage of correct decision (PCD). The PCD is the percentage ratio of the same decision of the crossed or nested designs, to the population. Figure 3 shows the PCD over σ~S2 for each performance metric and shows that the vertical dashed lines and the regions are equivalent to Figure 2. The PCD of the crossed design decreases to about 50% around the borderlines of decision criteria and close to one at the other range of σ~S2. It is reasonable because even a small variation of the performance metrics at the border lines results in a different decision. If the AGRR works correctly, the PCD must be close to 100% except for the borderlines. However, the trajectories of the PCD of the nested design decrease rapidly up to almost zero and do not recover to 100% until a range of σ~S2. In region III, those are about 60% (in case of NDC, 70%). It implies that the decision of the nested design is almost random.

Percentage of correct decision by performance index for all populations.

5.3. Effects of the Allocation of the Variance Components and the Experimental Design

In this subsection, we investigate the causes of the poor quality of the nested design in the perspective of allocation of variance components and the experimental design. Figure 4 shows the % Study Variation for the allocations of variance components (σ~O2,σ~SO2,σ~E2) in Table 5, and Figure 5 shows % Study Variation for the experimental designs, oC,sC,rC and oN,sN,rN, in Table 6. As mentioned in Section 5.1, the metric should be close to the population as σ~S2 increases. However, the gap is still large in the region III irrespective of the allocation of σ^O2,σ^SO2,andσ^E2, while the gap of the crossed design is close to zero. This phenomenon is very similar to that in the experimental design. In general, increasing degree of freedom improves the estimation quality of variance components in AGRR; thus it decreases the gap to the true value. However, in Figure 6, changing the experimental design does not improve the gap of the nested design significantly. The adequacy of the nested design is still very low at all regions. From the above results, we can conclude that the allocation of the variance components and the experimental design are not critical to the poor adequacy of the nested design.

Average % Study Variation for the initial variance components.

Average % Study Variation for the experimental designs.

Histogram of sample repeatability from four different populations.

5.4. Robustness of AGRR

Next, we draw histograms of the repeatability and reproducibility at four distinct σ~S2, as shown in Figures 6 and 7, respectively. We fixed the other parameters to eliminate the side effects: (σ~O2,σ~SO2,σ~E2) as (1,1,1), oC,sC,rC as (3,9,2), and oN,sN,rN as (3,3,6). As for repeatability, as shown in Figure 6, both designs have histograms similar to the Chi-squared distribution at every σ~S2. It coincides with the theoretical result. However, this is not so with reproducibility. In the case of the crossed design, the shape of the histogram still seems to be Chi-squared distribution. However, as shown in Figure 7, the histograms of the nested design differ, which have many zeroes and spread widely over all ranges. The zeroes of reproducibility make the effect of the operator statistically insignificant; hence, the results of AGRR are unstable. This implies that the estimation of reproducibility of the nested design is inadequate and it could be the main reason for the inadequacy of the design.

Histogram of sample reproducibility from four different populations.

Table 7 shows the estimates of the variance components at fourteen σ~S2s for the population, the crossed design, and the nested design. All estimates of the crossed design are very close to the estimates of the population. On the other hand, the AGRR of the nested design overestimates the reproducibility, σ^O2, while it properly estimates the repeatability and σ^O2+σ^SO2. The overestimation of σ^O2 increases according to σS2. Since several estimates of σ^O2 are zero in Figure 7, its overestimation implies that the nonzero estimates are very large. In the nested design, an operator does not share samples with other operators; therefore, it is hard to separate the variability of the operator and the sample in ANOVA effectively. That is, the variability of σS2 leaks to estimates of σ^O2, and, consequently, it increases σ^O2 and makes the adequacy low.

Average estimates of variance components of the population, the crossed design, and the nested design.

Population Crossed design Nested design
σ E 2 σ O 2 σ S O 2 σ S 2 Repeatability σ^E2 Reproducibility σ^O2+σ^SO2 σ ^ S 2 Repeatability σ^E2 Reproducibility σ^O2 σ ^ S 2 + σ ^ S O 2
0.978323 0.983632 0.973362 0.977343 0.991771 1.948983 0.992134 0.977858 1.066857 1.971704
0.967414 0.966841 0.973394 1.946821 0.981132 1.919912 1.94997 0.968178 1.109688 2.912493
0.975347 0.976984 0.973246 3.874717 0.987337 1.925811 3.895092 0.97622 1.266511 4.876734
0.970432 0.970866 0.973523 7.757004 0.984564 1.944843 7.806784 0.971788 1.63028 8.772708
0.97772 0.971914 0.973365 15.48564 0.995433 1.942836 15.483 0.982204 2.356923 16.44207
0.969531 0.972552 0.973272 30.93152 0.981883 1.940877 31.04326 0.970869 3.81615 31.86321
0.973194 0.97107 0.973456 62.02678 0.986751 1.937034 61.96536 0.975096 6.844492 62.9203
0.969944 0.973756 0.973213 124.708 0.980781 1.950806 124.9287 0.968457 12.68279 126.2736
0.968533 0.972231 0.973198 249.9384 0.982749 1.956499 251.2949 0.969171 25.92697 251.5965
0.969467 0.96702 0.973414 497.5466 0.983512 1.931515 496.9132 0.968052 50.91509 495.8767
0.972785 0.971262 0.973209 994.6194 0.98618 1.940338 1000.073 0.975168 99.09859 997.4504
0.9712 0.970518 0.973416 1999.386 0.984346 1.936489 1987.74 0.973287 189.851 1992.567
0.97488 0.967591 0.97352 3993.367 0.987316 1.933699 4007.995 0.974775 383.4533 4018.364
0.97684 0.975879 0.973405 7910.442 0.991661 1.943502 7960.486 0.979389 781.8932 7959.333
5.5. Evaluation of Robustness

The AGRR is based on sampling statistics. Even though the population is the same, each run of the AGRR can differ. If an AGRR method is reliable and robust, the variance of the results should be small. To investigate the robustness of AGRR, we employed the symmetric mean absolute percentage error (sMAPE) as follows :(5)sMAPEk=1n·i=1nσ^Q,i2-σQ,p2σ^Q,i2+σQ,p2/2·100,where σ^Q,i2 denotes estimates of Q with the ith set of observations and σQ,p2 is the true value of Q from the population, n is the number of repetition, and k is the index of scenario. sMAPE is a non-scale dependent variability measure; therefore the range is from −200% to 200%. However, since the values of σ^Q,i2 and σQ,p2 are all nonnegative, in our case, sMAPEk is nonnegative. The smaller the sMAPEk, the lower the variability. Figure 8 shows the averaged sMAPEk of repeatability, reproducibility, R&R, and σ^S2 for the crossed design or σ^S2+σ^SO2 for the nested design. As for the crossed design, all averaged sMAPEk are very stable. On the other hand, the averaged sMAPEk of the nested design are unstable except repeatability. The severity of the reproducibility and, consequently, the R&R are critical in regions II and III, where the value is over 150%. This result implies that the reproducibility of the nested design is not robust.

Averaged sMAPEk of the crossed design and the nested design.

6. Conclusions

In this paper, we evaluated the adequacy of the AGRR from the perspective of practitioners. To this end, we designed a Monte Carlo simulation, which is different from conventional approaches but close to the actual AGRR process. We considered and compared two main experimental structures, crossed and nested designs, regarding four popular performance metrics for various combinations of σS2,σO2, σSO2, and σE2. The experimental results show that the adequacy of the crossed design is appropriate for all the evaluation perspectives: the average performance metrics, PCD, histogram shape, and sMAPE. However, the adequacy of the nested design is very low for all evaluation terms. We revealed that inadequacy comes from the overestimation of σ^O2. We tried to solve this problem by increasing the number of operators, but the problem is still unsolved. In conclusion, we highly recommend not applying the nested design as a tool of AGRR unless a solution to this problem is found. This solution could be any compensation coefficient of σ^O2 or other experimental design. This could be a topic of further research.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education, Science, and Technology (NRF-2013R1A1A2006947).

AIAG Measurement Systems Analysis Reference Manual 2010 4th http://www.amazon.com/Measurement-Systems-Analysis-MSA-AIAG/dp/B004Z0V40G Montgomery D. C. Runger G. C. Gauge capability and designed experiments. Part I: basic methods Quality Engineering 1993 6 115 135 http://www.tandfonline.com/doi/pdf/10.1080/08982119308918710 Montgomery D. C. Runger G. C. Gauge capability analysis and designed experiments. Part II: Experimental design models and variance component estimation Quality Engineering 1993 6 2 289 305 2-s2.0-49649113154 10.1080/08982119308918725 Burdick R. K. Larsen G. Confidence intervals on measures of variability in RR studies Journal of Quality Technology 2015 29 3 261 273 http://search.proquest.com/openview/6ab62569696a5aaf2e7c3d078b6e99c1/1?pq-origsite=gscholar Gorman D. Bower K. M. Measurement System Analysis And Destructive Testing Six Sigma Forum Magazine 2002 16 19 Pan J.-N. Determination of the optimal allocation of parameters for gauge repeatability and reproducibility study International Journal of Quality and Reliability Management 2004 21 6 672 682 2-s2.0-33747069130 10.1108/02656710410542061 Bergeret F. Maubert S. Sourd P. Puel F. Improving and applying destructive gauge capability Quality Engineering 2001 14 1 59 66 2-s2.0-0035438529 10.1081/QEN-100106887 Minitab Data Analysis and Quality Tools User’s Guide 2 2000 Minitab Inc. Borror C. M. Montgomery D. C. Runger G. C. Confidence intervals for variance components from gauge capability studies Quality and Reliability Engineering International 1997 13 6 361 369 10.1002/(SICI)1099-1638(199711/12)13:6<361::AID-QRE126>3.3.CO;2-U Graybill F. A. Wang C. M. Confidence intervals of nonnegative linear combinations of variances Journal of the American Statistical Association 1980 75 372 869 873 MR600969 10.1080/01621459.1980.10477565 SAS Institute SAS Technical Report P-229, SAS - STAT Software: Changes and Enhancements, Release 6. 07 1992 SAS Publishing http://www.barnesandnoble.com/w/sas-technical-report-p-229-sas-stat-software-s-a-s-institute-incorporated/1014558798?ean=9781555444730 Milliken G. A. Johnson D. E. Analysis of Messy Data, Volume I: Designed Experiments 1984 2nd CRC Press https://scholar.google.co.kr/scholar?q=analysys+of+messy+data+milliken&amp;btnG=&amp;hl=ko&amp;as_sdt=0%2C5#1 Dolezal K. K. Burdick R. K. Birch N. J. Analysis of a two-factor R & R study with fixed operators Journal of Quality Technology 2015 30 2 163 170 http://search.proquest.com/openview/4779c01d797cabf0e397f9628a263da7/1?pq-origsite=gscholar Hamada M. Weerahandi S. Measurement system assessment via generalized inference Journal of Quality Technology 2015 32 3 241 253 http://search.proquest.com/openview/0f0733ae998d9328b59078f57edf677d/1?pq-origsite=gscholar Chiang A. K. A simple general method for constructing confidence intervals for functions of variance components Technometrics 2001 43 3 356 367 10.1198/004017001316975943 MR1943189 Daniels L. Burdick R. K. Quiroz J. Confidence intervals in a gauge RR study with fixed operators Journal of quality technology 2015 37 3 179 185 http://search.proquest.com/openview/3a7cd62a2f71bc321da92f1763835158/1?pq-origsite=gscholar Wang F.-K. Li E. Y. Confidence intervals in repeatability and reproducibility using the Bootstrap method Total Quality Management and Business Excellence 2003 14 3 341 354 2-s2.0-0042229224 10.1080/1478336032000046643 Burdick R. K. Allen E. A. Larsen G. A. Comparing variability of two measurement processes using RR studies Journal of Quality Technology 2015 34 1 97 105 http://search.proquest.com/openview/8f3f897b4084bd2af582e7c186737db6/1?pq-origsite=gscholar Cochran W. G. Testing a linear relation among variances Biometrics. Journal of the Biometric Society 1951 7 17 32 10.2307/3001601 MR0040626 Chiang A. Improved confidence intervals for a ratio in an RR study Communications in Statistics-Simulation and Computation 2002 31 3 329 344 http://www.tandfonline.com/doi/abs/10.1081/SAC-120003845 10.1081/SAC-120003845 Burdick R. K. Borror C. M. Montgomery D. C. A review of methods for measurement systems capability analysis Journal of Quality Technology 2015 35 4 342 354 http://search.proquest.com/openview/95d1b0694be0520e1da9f900eda08f61/1?pq-origsite=gscholar Adamec E. Burdick R. K. Confidence intervals for a discrimination ratio in a gauge RR study with three random factors Quality Engineering 2015 15 3 383 389 http://www.tandfonline.com/doi/abs/10.1081/QEN-120018036 Burdick R. K. Park Y-J. Montgomery D. C. Confidence intervals for misclassification rates in a gauge RR study Journal of Quality Technology 2015 37 4 294 303 http://search.proquest.com/openview/95ae579bd695c0f14bc0a7a8d32154df/1?pq-origsite=gscholar Woodall W. H. Borror C. M. Some relationships between gage R & R criteria Quality and Reliability Engineering International 2008 24 99 106 http://onlinelibrary.wiley.com/doi/10.1002/qre.870/abstract Browne R. MacKay J. Steiner S. Leveraged Gauge RR Studies. Technometrics 2010 52 3 294 302 Pan J.-N. Li C.-I. Ou S.-C. Determining the optimal allocation of parameters for multivariate measurement system analysis Expert Systems with Applications 2015 42 20 7036 7045 2-s2.0-84930959245 10.1016/j.eswa.2015.04.038 De Mast J. Trip A. Gauge R&R studies for destructive measurements Journal of Quality Technology 2005 37 1 40 49 2-s2.0-18144429400 Van Der Meulen F. De Koning H. De Mast J. Nonrepeatable gauge R&R studies assuming temporal or patterned object variation Journal of Quality Technology 2009 41 4 426 439 2-s2.0-70350164117 Wallström P. Segerstedt A. Evaluation of forecasting error measurements and techniques for intermittent demand International Journal of Production Economics 2010 128 2 625 636 2-s2.0-78049302509 10.1016/j.ijpe.2010.07.013