Confidence interval approach for evaluating bias in laboratory methods

A statistically significant difference in mean values between two laboratory quantitation methods is interpreted as a bias. Sometimes such a difference is so minute that it does not constitute any practical concern. An alternative approach is to test statistically whether the two methods are close enough, not for equality. This is to look at the confidence interval of the mean method difference and does not entail any additional statistical tests.


Introduction
When comparing a new laboratory quantitation method to a standard method, the new method ideally should give the same, if not more accurate, results than the standard method. Any systematic difference between the two methods is called a bias. The conventional statistical evaluation of bias is by testing the null hypothesis that the quantitation result from the new method equals that of the standard one. The methodology for comparing two means is given in most statistical references, for example see Snedecor and Cochran, chapter 4 [1].
One concludes that the new method is biased if the test statistic is significant at the 0-level, where 0 is the type I error chosen by the investigator, for example 0 0"05. The test statistic is usually of the form: (1) where d is the mean method difference and SE(d) is the standard error of d.
Sometimes a statistically significant difference, d, is so small in magnitude that it does not materially affect the quantitation ofsamples. Such a difference is not meaningful in laboratory practice.
Indeed, one can argue that because the new and standard methods are not identical, given enough samples, one can always demonstrate that there is a bias with the new method. The new method is penalized because of the high precision in our statistical evaluation process. An alternative approach in statistical evaluation is clearly needed in this instance.

Method
The idea of testing two mean values for being similar but not necessarily identical is not new. In the pharmaceutical industry, two formulations of a drug are said to be bioequivalent if their mean values with respect to some 144 clinical or pharmacokinetic parameter are close enough [2]. The same concept can be applied to comparing two laboratory quantitation methods: two methods are said to be 'equivalent' if their mean difference is less than a prescribed quantity, say H, which is called the maximum acceptable difference. H could be chosen from experience or be a value deemed practical by the investigator.
Statistically one accepts the equivalence of two laboratory methods at the 0Mevel if the (1-0) 100% confidence interval of the mean method difference d, say (C1, C2), is completely contained in the interval (-H, H), i.e.
-H < C1 < d < C2 < H (2) If the method difference is assumed to be normally distributed then the test statistic (equation [1]) is a t-distribution, and the (1-0) 100% confidence interval ofd is: The above test procedure is also applicable when there is only one laboratory method. Sometimes one wants to test whether the laboratory method is accurate enough with respect to a known target value, say T. Let X be the mean value of the laboratory method, then d X-Tand SE(d) SE(X), and the test procedure is the same as described above. For example the laboratory method is a HPLC assay measuring the recovery of a chemical entity. The natural target value (T) will be 100% in this instance.
The two methods case can sometimes be reduced to a one method situation. If the ratio, instead of the difference, of the two methods is being investigated and assumed to be Normally distributed, then X mean ratio and the target T 100% again. This situation will be illustrated in the next section.

Example
To illustrate the test of equivalence, an example has been taken from Griffiths et al. [3]: the potassium levels of 21 patient serum specimens were analysed by Beckman Astra-8 and flame photometry methods. The Astra method was arbitarily defined as the standard method, and the flame photometry as the new method. The ratio of the new method to the standard method (expressed in %) was assumed to be normally distributed and was used to evaluate bias of the new method. The raw data and the ratio are reproduced in table 1. The target value T 100%. From table 1, d 101"6-100 1"6%, SE(d) 0"29%. In the conventional significance test approach, the test statistic: d/SE(d) 1"6/0"29 5"52 with 20 degrees of freedom was highly significant (the critical value at 0 0'05 and 20 degrees of freedom is 0.025,20 2"086), i.e. the flame photometry was biased in that its measurement was on average 1.6% higher than that of the Astra method.
However, if one feels that only method differences exceeding a certain level constitute meaningful difference, then the test of equivalence approach will be more appropriate. From equation (3)  () /4 0.5 The 95% confidence interval of d was completely outside the interval (-0"5, 0"5). The flame photometry method gave significantly higher results than the Astra method by 0"5% or more. The conclusion is the same as the conventional significance test in this instance, i.e. not equivalent.
(1) tt 0.s Not equivalent (2) H 1.5 Figure 1. The test of equivalence between the Astra and Flame methods for various maximum acceptable differences, H. { } is the 95% confdence interval of the mean difference d.
(2) H 1.5 The 95% confidence interval of d overlapped the interval (-1.5, 1"5). There was inconclusive evidence to discern the equivalence of the two methods, one way or the other. It indicates that more samples are required to reach any conclusion at this significance level.
() H .0 The 95% confidence interval of d was completely contained in the interval (-3,3). The flame photometry method was equivalent to the Astra method in that the results of each differed from the other by less than 3%.

Discussion
One can argue that there is always a bias between two laboratory quantitation methods. A statistically significant difference between methods is of acadmic interest only, unless the magnitude of the difference also has a practical implication. A dogmatic application of the conventional significance test might not give a meaningful interpretation. In contrast, the test of equivalence provides wider scope, and sometimes a more realistic approach in comparing two methods. No additional statistical test is needed in this evaluation procedure.
The choice of the maximum acceptable difference, H, is crucial in the successful application of the test of equivalence. This is not a statistical decision, but one which must be determined from experience, or a value which is deemed meaningful from the practitioner's perspective.
When there is inconclusive evidence to detect whether two methods are equivalent, more samples are needed in the test. The methodology for determining the optimal sample size in a comparative experiment using the conventional significance test is well known [1], but the calculation is far more complex in the equivalence test context [4], and is beyond the scope of this note.
So far only the case where data are normally distributed has been illustrated. There is no difficulty in applying the equivalence test idea to data having other types of distribution so long as the confidence interval of the mean method difference can be obtained.