A randomized clinical experiment to compare two types of endotracheal tubes utilized a block design where each of the six participating anesthesiologists performed tube insertions for an equal number of patients for each type of tube. Five anesthesiologists intubated at least three patients with each tube type, but one anesthesiologist intubated only one patient per tube type. Overall, one type of tube outperformed the other on all three effectiveness measures. However, analysis of the data using an interaction model gave conflicting and misleading results, making the tube with the better performance appear to perform worse. This surprising result was caused by the undue influence of the data for the anesthesiologist who intubated only two patients. We therefore urge caution in interpreting results from interaction models with designs containing small blocks.
A clinical research investigation by Radesic et al. [
In the course of conducting this study, it turned out that some of the APs who were enlisted to participate were seldom available, while others were frequently available. In order to complete the investigation within an allotted time frame, the number of patients per AP was altered with more than ten patients for some APs and fewer than ten for others. Still, each AP had an even number of patients with half being randomized to each type of tube. One particular AP had only two patients, one per tube type. In the original analysis presented in Radesic et al. [
In this paper, we first provide some additional details of both the design and original analysis of the anesthesiology tube study by Radesic et al. Then we will illustrate the specific problem that arises when an interaction term is added to the statistical model. Finally, we discuss how such a problem could arise in many other situations where an interaction term may be included in a model.
The purpose of the study by Radesic et al. [
Six APs and 60 patients participated in the study. In the modified design, one AP intubated 22 patients, another intubated 18 patients, three APs performed 6 intubations each, and one AP only performed two intubations. The APs were balanced with respect to the ETT type, in that half of each AP’s intubations were done with the PFT tube and half with the standard Mallinckrodt tube. In the original analysis [
The analysis presented in [
When the results were averaged for the 58 patients (aggregated over the five APs and the covariates), the PFT tube had lower (better) mean responses on each of the dependent variables. Likewise, for all three dependent variables, the adjusted means resulting from the model described above were lower for the PFT.
In this paper, we will do a similar analysis, this time using the data for all 60 patients and all six APs. To make our point in the most straightforward fashion, our analysis will exclude the two covariates. For the same reason, we will keep the dependent variables in their original units, rather than using log transformations. (The presence of covariates in the model or the use of transformed data does not change the essence of the results.)
Table
Intubation outcomes for the Parker FlexTip and standard tubes.
Dependent variables  Parker FlexTip 
Standard ETT 

Time for ETT insertion (sec)  10.9 (7.5)  12.4 (7.3) 
Number of redirections  0.7 (1.5)  1.3 (2.7) 
Difficulty of insertion rating  14.3 (14.9)  17.4 (19.7) 
Mean intubation outcomes for the Parker FlexTip and standard tubes for each of the six anesthesiology providers.
Dependent variables  Parker FlexTip  Standard ETT 

AP#1 


Time for ETT insertion (sec)  9.0  14.0 
Number of redirections  1.0  2.0 
Difficulty of insertion rating  16.7  19.0 
AP#2 


Time for ETT insertion (sec)  6.7  9.9 
Number of redirections  0.0  0.7 
Difficulty of insertion rating  3.7  11.5 
AP#3 


Time for ETT insertion (sec)  14.7  17.1 
Number of redirections  1.6  2.4 
Difficulty of insertion rating  21.7  31.8 
AP#4 


Time for ETT insertion (sec)  15.0  5.0 
Number of redirections  3.0  0.0 
Difficulty of insertion rating  60.0  7.0 
AP#5 


Time for ETT insertion (sec)  8.0  6.0 
Number of redirections  1.0  0.0 
Difficulty of insertion rating  13.3  11.7 
AP#6 


Time for ETT insertion (sec)  18.7  14.7 
Number of redirections  0.0  0.7 
Difficulty of insertion rating  14.2  3.0 
First, consider the results of an additive model in which the factors are tube type (fixed) and anesthesiology provider (random). Such a model will allow us to compare the tube types, while adjusting for potential differences among the APs with respect to the dependent variables. For example, some APs could be faster at performing intubations than others. Variation in the dependent variables due to AP differences would then be accounted for and removed from the “error term” used for comparing the tube types. Univariate twoway ANOVAs were run for each of the three dependent variables. According to ANOVA
Least squares adjusted means for each type of tube using an additive model or an interaction model.
Dependent variables  Additive model  Interaction model  

Parker FlexTip  Standard ETT  Parker FlexTip  Standard ETT  
Time for ETT insertion (sec)  10.8  12.3  12.0  11.1 
Number of redirections  0.8  2.3  1.1  1.0 
Difficulty of insertion rating  16.3  19.4  21.6  14.0 
In order to allow for the possibility that the differences between the tube types may vary among APs, an interaction term was added to the model. For example, some APs may tend to perform better with one tube while other APs do better with the other tube. Again, univariate ANOVAs were run for each of the three dependent variables, this time with the interaction term, tube type
The adjusted means shown in Table
We believe that the results obtained using the interaction model are misleading due to the undue influence of the results for the one AP who intubated only one patient with each type of ETT. Further, we were somewhat surprised by this, because the design was balanced in the sense that each AP used each ETT type the same number of times, meaning that the ETT and AP factors are orthogonal in the design matrix.
The misleading results obtained in the ETT study could arise in many similar situations. Here is a simple example to illustrate the problem in the context of a twofactor factorial analysis. Suppose that Factors A and B have
If
Hypothetical data for a twofactor study.
Factor B level 1  Factor B level 2  

Factor A level 1  10 11 12 11 10  8 6 5 7 7 
Factor A level 2  9 11 12 10  6 7 4 5 
Factor A level 3  4  15 
In this case, the raw means for the two levels of Factor B differ by 3.0 with the B1 mean higher than the B2 mean (Table
Raw and adjusted means for the two levels of B for the hypothetical data.
Factor B level 1  Factor B level 2  

Raw means  10.00  7.00 
Adjusted means; additive model  10.23  7.23 
Adjusted means; interaction model  8.43  9.03 
Minitab’s General Linear Model ANOVA produces the same results for both the additive and interaction models. To its credit, Minitab also issues a warning in its output that the two observations for A3 have high leverage. To investigate this further, we performed regression analyses, which allowed us to assess the leverage and influence of the two data values for A3. To do this, we created indicator variables for A1, A2, and B1 and multiplicative interaction terms
There are many clinical studies, such as the ETT comparison described here, where allocation of patients to treatments may be blocked or stratified (see [
The authors declare that there is no conflict of interests regarding the publication of this paper.