Measurement of Interobserver Disagreement : Correction of Cohen ’ s Kappa for Negative Values

As measures of interobserver agreement for both nominal and ordinal categories, Cohen’s kappa coefficients appear to be the most widely used with simple and meaningful interpretations. However, for negative coefficient values when (the probability of) observed disagreement exceeds chance-expected disagreement, no fixed lower bounds exist for the kappa coefficients and their interpretations are no longer meaningful and may be entirely misleading. In this paper, alternative measures of disagreement (or negative agreement) are proposed as simple corrections or modifications of Cohen’s kappa coefficients. The new coefficients have a fixed lower bound of −1 that can be attained irrespective of the marginal distributions. A coefficient is formulated for the case when the classification categories are nominal and a weighted coefficient is proposed for ordinal categories. Besides coefficients for the overall disagreement across categories, disagreement coefficients for individual categories are presented. Statistical inference procedures are developed and numerical examples are provided.


Introduction
When two (or more) observers are independently classifying observations or items (objects) into the same set of  mutually exclusive and exhaustive categories, it may be of interest to have a summary description of the extent to which the observers agreed in their classifications.The total probability (proportion) of agreement is one such obvious summary measure.However, since some agreement is to be expected purely by chance, Cohen [1] introduced the kappa coefficient of agreement as one that corrects for the chanceexpected agreement.Cohen's kappa has since become widely used in a variety of situations and discussed extensively in various textbooks (e.g., [2][3][4][5]) and a wide variety of journal publications (e.g., [6][7][8][9][10]).
In order to define the kappa coefficient in terms of probabilities (proportions), let   be the probability that a random observation is assigned to category  by Observer 1 and to category  by Observer 2 for  = 1, . . .,  and  = 1, . . ., .Furthermore, let  + denote the probability that a randomly chosen observation is assigned to category  by Observer 1 and  + the probability that a randomly chosen observation is assigned to category  by Observer 2 (,  = 1, . . ., ).If these probabilities are represented in terms of a two-way contingency table with rows  = 1, . . .,  and columns  = 1, . . ., , then   becomes the probability in cell (, ) and { + } becomes the marginal row distribution and { + } becomes the marginal column distribution.With the row categories and the column categories being the same, ∑  =1   is the total probability of agreement between the two observers.Cohen [1] used the overall statistical independence as the condition for chance agreement and defined  as +  + (1) with  AO and  AC being the observed agreement probability and the chance-expected agreement probability, respectively.In terms of the observed and chance-expected disagreement It is clear from (1)-( 2) that  = 1 if the interobserver agreement is perfect, that is, if  AO = 1 ( DO = 0),  = 0 if  AO =  AC ( DO =  DC ), and  < 0 if  AO <  AC ( DO >  DC ).The case of negative -values will be discussed further in the next section.
To account for the potential fact that some disagreements may be more serious than others, as when the  categories have a natural order, Cohen [13] and Cicchetti and Allison [14] independently introduced the weighted kappa   , which can be expressed as where each weight   ∈ [0, 1], with   = 0 and V  = 1 −   for all  and  and with the following logical weight choices (e.g., [2, page 609]): ,  = 1, . . ., .
For a specific category , Kvålseth [15] proposed the following measure as an extension of (4): with   denoting the set of all disagreement cells for category .The values of these weighted measures equal 1 for perfect agreement and 0 if observed agreement equals chance agreement, with negative values if observed agreement is less than chance agreement.The kappa coefficients in ( 1)-( 8) may be appropriate measures of agreement when their values are nonnegative, but not when their values are negative as discussed in the next section.From a theoretical point of view at least, it is certainly troublesome that their negative values lack appropriate meaning and validity.This paper presents simple corrections or modifications of the kappa coefficients in (1)-( 8) such that the negative values of the corrected coefficients provide appropriate representation of the extent to which the observers disagree.Statistical inference procedures for the new coefficients or measures are developed.Numerical examples are also given.

Comments on Kappa
One of the most appealing properties of kappa, and undoubtedly a reason for its popularity, is its simplicity and transparency.All the kappa coefficients in (1)-( 8) have intuitively appealing and meaningful interpretations.In the case of  in (1)-(2), for example, it seems most meaningful to interpret any -value in terms of (2) as the proportional difference between  DC and  DO , that is, the relative extent to which the observed disagreement probability  DO is less than the disagreement probability  DC attributable to chance.By comparison, the norming used in (1) is not unique, with any number of different potential denominators  such that ( AO −  AC )/ ≤ 1 [16].
Note that the two expressions for  in (1)-( 2) are weighted arithmetic means of the expressions for   in (3)-( 4).Thus, from ( 1) and ( 3), for instance, it is seen that Similarly, for the weighted measures in ( 6) and ( 8), In order to show that the interobserver agreement for a specific category  can be determined directly from ( 3)-( 4), without the need to collapse the original  ×  table as suggested by Spitzer et al. [11], consider that the original  ×  table with probability components   ,  + , and  + for category  is collapsed into the following 2 × 2 table: When ( 12) is substituted into (1),   in (3) results immediately.However, no such corresponding procedure applies to   in (6) and   in (8).Note that, for  = 2,  1 =  2 =  and  1 =  2 =   .
In spite of its wide appeal, kappa is not without some criticism or controversy, especially related to its dependence on the marginal distributions { + } and { + } (see, e.g., [4, pages 168-173]).The chance agreement (disagreement) for all the kappa coefficients in (1)-( 8) is based on the marginal distributions.If those distributions are highly uneven (nonuniform) and nearly symmetric, the values of the kappa coefficients may become unreasonably small due to the relatively large chance agreements.
A clear limitation of the kappa coefficients relates to situations when the values of those coefficients become negative and lack meaningful interpretations.This limitation has generally been ignored in published studies, partly perhaps because such studies using kappa have typically involved positive kappa values.Negative kappa values could, however, lead to incorrect interpretations, results, and conclusions.Also, if, for instance,  > 0 in (1)-( 2), it is possible that some   < 0 in (3)-(4).
For the overall kappa in (1)-( 2), when  AO <  AC so that  < 0,  has no reasonable meaning in terms of (1), but − does in terms of (2); that is, − is the relative extent to which  DO exceeds  DC .The same argument applies to   in (3)-(4).However, two serious limitations of all the kappa coefficients are that, for negative values, (a) the coefficients have no fixed lower bounds, making it impossible to appropriately assess the size or magnitude of coefficient values, and (b) the coefficients take on negative values that do not appear reasonable as discussed below.
The minimum values − AC /(1 −  AC ) of  in (1) and − +  + /(  −  +  + ) of   in (3) depend exclusively on the marginal distributions { + } and { + }.Values such as  = −0.4 or   = −0.2 are uninformative since they cannot be related to any fixed lower bounds on  or   such as −1, irrespective of the marginal distributions.There is no basis for making any interpretation or statement such as  = −0.5 indicating a "moderate," "low," or "high" level of disagreement between the two observers.
There is also some confusion in the literature about the minimum value of , with some stating that the minimum value is −∞ or − AO /(1 −  AO ) [5, page 4] and others stating that it is −1 when  + =  + = 1/ for all  and  [17, page 113].Such statements are clearly incorrect.In fact, the minimum value  = − AC /(1 −  AC ) equals −1 if, and only if,  AC = 0.5.Similarly, the minimum value of   in (3) equals −1 only when the harmonic mean 2 +  + /( + +  + ) of  + and  + equals 0.5.
What is needed are chance-corrected measures of disagreement, both weighted and unweighted, which have fixed lower bounds of −1 and which are attainable irrespective of the marginal distributions.This requirement has also been clearly emphasized by others [18].Such measures will be introduced in the next section as simple corrections or modifications of the existing kappa coefficients.

Proposed Kappa Coefficients of Disagreement
3.1.Overall Coefficients.When  AO <  AC and hence  DO >  DC , it seems most logical and intuitive to define negative overall kappa as where  AO ,  AC ,  DO , and  DC are the probabilities defined in (1)-( 2).Consequently, where, of course,  =  − = 0 for  AO =  AC .Except for the minus sign,  − in (13) follows from  in (1)-( 2) by simply substituting disagreement probabilities for the corresponding agreement probabilities.
The properties of  − can be summarized as follows: (P1)  − is well defined if at least two cells of the contingency table contain nonzero probabilities.(P4) | − | has a meaningful interpretation as the relative extent to which the observed agreement probability is less than that expected by chance alone.
(P5)  − takes on values that appear reasonable throughout its 0 to −1 range.
While Properties (P1)-(P4) are immediately apparent from the definition in (13), Property (P5) needs an explanation.This can most simply be done for the  = 2 category case and without undue loss of generality since, for any data set with  > 2, there exists an equivalent 2 × 2 table with the same  − -value.Therefore, one may consider a 2 × 2 table such as the one in Table 1 with the marginal probabilities  and 1− (0 ≤  ≤ 1).The first two entries in each cell correspond to the cases when  − = −1 and 0, respectively, while the third entry equals the weighted arithmetic mean of the other two entries with weights  and 1 −  (0 ≤  ≤ 1).
In order for the values of  − to be considered reasonable throughout the [−1, 0]-interval, the only logical condition would clearly seem to be that the value of  − for the weighted mean cell probabilities should equal the weighted mean value of  − for the other cell probabilities with the same weights  and 1 − ; that is, By substituting the expressions for the mean cell probabilities from Table 1 into  − in (13), it is seen that  − does meet the condition in (15), irrespective of the marginal probabilities  and 1 − .This assumes, of course, as with Cohen's , that chance agreement (disagreement) based on the marginal probabilities is reasonable.
By contrast, substituting the mean probabilities from Table 1 into  in (1)- (2) gives showing the strong dependence of  on the marginal probabilities.The parenthetical term in ( 16) equals 1 if  = 0.5 and approaches 0 as the marginal distributions become highly uneven or nonuniform (i.e., as  approaches 0 or 1).When   < 0 in ( 5)-( 6) and hence +  + and with the sets of weights {V  } and {  } as defined in (7), the following weighted negative kappa is proposed: and hence Except for the minus sign,  −  in (17) follows from ( 5)-( 6) by simply substituting {  } for {V  } in (5) and {V  } for {  } in (6).
−  is well defined if at least two cells of the  ×  table contain nonzero probabilities.It is also apparent from (17) that  −  takes on values between 0 and −1, inclusive, with  −  = 0 if   =  +  + for all  and  (as a sufficient but not necessary condition).Also  −  = −1 if, and only if,   = 0 for all  and  except for  = 1 and  =  and  =  and  = 1, that is, when the only nonzero probabilities occur in the corner cells (1, ) and (, 1) and the weights are of the type of form as in (7).These properties of  −  all appear to be reasonable.By contrast, if  1 ̸ = 0,  1 ̸ = 0, and all other   = 0,   in ( 5)-( 6) becomes   = −2 1  1 /(1 − 2 1  1 ), which equals −1 only if  1 =  1 = 0.5.Otherwise, the value of   increases as  1 and  1 become increasingly different, approaching 0 as | 1 −  1 | approaches 1.Such behavior of   < 0 makes any reasonable interpretation of negative  values impossible and meaningless.
In terms of weights V  = 1 −   , with the types of   as in (7), the proposed specific-category weighted kappa coefficient may be defined as where, as always, the first subscript refers to the table row and the second subscript to the table column.Note that the component for cell (, ) appears twice in  −  .Note also that, analogous to (20),  −  in ( 17) is the weighted arithmetic mean of  − 1 , . . .,  −  in (21) with weights based on the denominator in (21) for  = 1, . . ., .It is apparent from (21) that, for the weights in (7) and with and  1 are the only nonzero probabilities in the table.

Statistical Inferences
Consider now that the coefficients (measures) discussed above are all sample estimates (and estimators) based on the sample probabilities   =   / (,  = 1, . . ., ) with frequencies (counts)   and sample size  = ∑  =1 ∑  =1   .  's are maximum likelihood estimates (and estimators) of the unknown population probabilities   (,  = 1, . . ., ) on which the corresponding population coefficients are based such as the population coefficient (6).It may then be of interest to make statistical inferences about the population coefficients corresponding to the sample coefficients discussed above.
Such statistical inferences would probably be most meaningful in terms of the construction of confidence intervals for the overall kappa coefficients in ( 14) and (18).The inference procedure needs necessarily to be approximated for reasonably large sample size  and be based on the delta method (e.g., [19,Chapter 14]) or resampling methods such as the bootstrap and the jackknife (e.g., [20,21]).The delta method is chosen in this paper.By developing the procedure based on the   expression in (6), the procedures for  −  in (17),  − in (13), and  in (1) follow as special cases by the appropriate selection of the set of weights {  }.Fleiss et al. [2] gave the estimated large sample variance of   based on the expression in (5) without presenting any intermediate steps.Instead, the expression in (6) will be used here as being more convenient and some of the important intermediate steps will be presented.
Then, letting   in (6) denote both the sample estimate and estimator of the corresponding population coefficient   ({  }) (based on population probabilities   ,  + , and  + for  = 1, . . .,  and  = 1, . . ., ), it follows from the delta method that, under multinomial sampling (when the  categories and the sample size  are a priori fixed), the estimator   is approximately normally distributed with mean   ({  }) and estimated variance Var(  ) if  is reasonably large.
In order to derive the estimated variance of   , express   ({  }) as   ({  }) = 1 − / = 1 −  and let    ,    , and    denote the partial derivatives of these quantities with respect to   , with   then being replaced with the estimated probabilities   for all  and .Then, where for all  and .It is found that so that, from (23)-(24), from which one gets  2 are ordinal so that the weighted kappa coefficients would be appropriate.Then, with the weights   = |−|/(− 1) in ( 7) and V  = 1−  for all  and , it is found from Table 2 that ∑ 3 =1 ∑ 3 =1 V    = 0.4150 and ∑ 3 =1 ∑ 3 =1 V   +  + = 0.5610 so that, from ( 17)-( 18),  −  = −0.2602= −0.26.This weighted disagreement value differs considerably from the above  − = −0.65 value when the three categories are considered to be nominal.

Logistic Transformation.
Instead of making statistical inferences about the kappa coefficients directly, as done above, it is likely advantageous to do so indirectly via the logistic transformation.Therefore, in the case of  − in (13), consider the following logistic transformation of 1 +  − and its inverse: Since the derivative / − = −1/ − (1 +  − ), the estimated variance of  becomes where Var( − ) is given in (31).An approximate confidence interval for the population equivalent of  can then be constructed based on (33), with the corresponding confidence interval for  − ({  }) resulting from the inverse transform in (32).
In the case of  −  in (17),  − in (32)-( 33) is simply replaced with  −  .For  and   in (1) and ( 5)-( 6), the transformation becomes log[/(1−)] and log[  /(1−  )].With such transformations, the lower end of a confidence interval for  − or  −  cannot be less than −1 and the upper end of a confidence interval  or   cannot exceed 1.Most importantly, the normal distribution approximation is likely to be improved with the above logistic transforms.Unless the sample size  is very large, the distributions of the kappa coefficients are likely to be skewed, especially when a coefficient is near −1 or 1.For instance, when, say, the population coefficient  − ({  }) = −0.9, the estimator  − cannot be much smaller than  − ({  }), but it could be much larger with nonnegligible probability.The logistic transformation to the (−∞, ∞)-interval tends to correct for such skewness and provide for a more rapid convergence to normality. In

Conclusion
If Cohen's kappa is accepted as an appropriate measure of interobserver agreement, as many do judging by its widespread use, then the corrections proposed here for negative kappa values should be equally acceptable.Of course, since the chance-expected disagreement (or agreement) terms in the new coefficients also depend exclusively on the marginal distributions, the criticism by some that Cohen's coefficients depend too much on the marginal distributions would similarly apply to the new coefficients.Such concern is particularly important in cases of highly uneven (nonuniform or "skewed") marginal distributions.If, however, those distributions are fairly even (uniform), Cohen's kappa and hence the measures proposed in this paper for interobserver disagreement (negative agreement) would seem to be reasonably acceptable agreement-disagreement measures.

Table 1 :
A 2 × 2 contingency table with marginal probabilities r and 1 −  and with the entries in each cell as follows: first entry corresponds to  AD = 0, second entry corresponds to  AO =  AC , and third entry is the weighted mean of the other two entries where 0 ≤  ≤ 1.

Table 2 :
Results from  = 100 couples answering a multiple-choice question with three choice categories (fictitious data).