JMATH Journal of Mathematics 2314-4785 2314-4629 Hindawi Publishing Corporation 10.1155/2014/203907 203907 Research Article New Interpretations of Cohen’s Kappa Warrens Matthijs J. Wu Yuehua 1 Institute of Psychology Unit Methodology and Statistics Leiden University P.O. Box 9555 2300 RB Leiden The Netherlands leiden.edu 2014 392014 2014 31 05 2014 19 08 2014 3 9 2014 2014 Copyright © 2014 Matthijs J. Warrens. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Cohen’s kappa is a widely used association coefficient for summarizing interrater agreement on a nominal scale. Kappa reduces the ratings of the two observers to a single number. With three or more categories it is more informative to summarize the ratings by category coefficients that describe the information for each category separately. Examples of category coefficients are the sensitivity or specificity of a category or the Bloch-Kraemer weighted kappa. However, in many research studies one is often only interested in a single overall number that roughly summarizes the agreement. It is shown that both the overall observed agreement and Cohen’s kappa are weighted averages of various category coefficients and thus can be used to summarize these category coefficients.

1. Introduction

In various fields of science it is frequently required that an observer classifies a set of subjects into three or more nominal categories that are defined in advance. The observer may be a clinician who classifies children on the severity of a disease, a pathologist that rates the severity of lesions from scans, or a coder that transcribes interviews. If the observer did not fully understand what he or she was asked to interpret, or if the definition of the categories is ambiguous, the reliability of the rating system is at stake. To assess the reliability of the system researchers typically ask two or more observers to rate the same set of subjects independently. An analysis of the agreement between the observers can then be used as an indicator of the quality of the category definitions and the raters’ ability to apply them. High agreement between the ratings would indicate consensus in the diagnosis and interchangeability of the ratings.

There are several association coefficients that can be used for summarizing agreement between two observers . In biomedical and behavioral science research the most widely used coefficient for summarizing agreement on a scale with two or more nominal categories is Cohen’s kappa . The coefficient has been applied in thousand of research studies and is also frequently used for summarizing agreement if we have n observers of one type paired with n observers of a second type, and each of the 2 n observers assigns a subject to one of m categories. A closely related coefficient is Scott’s pi . The latter coefficient is commonly used in the field of content analysis [2, 10]. The two coefficients have similar formulas and differ in how agreement under chance is defined [3, 11].

Cohen’s kappa reduces the ratings of the two observers to a single real number. To provide a proper interpretation of the coefficient one must first understand its meaning. There are two descriptions of kappa in the literature. The observed or raw agreement is the proportion of subjects that is classified into the same nominal categories by both observers. Several authors have argued that the overall observed agreement is artificially high and should be corrected for agreement due to chance [4, 6, 12]. Kappa can be described as a chance-corrected version of the observed agreement. The second interpretation of kappa involves the 2 × 2 tables that are obtained by combining all the categories of the agreement table other than the one of current interest into a single category. If we have m categories, there are m associated 2 × 2 tables, one for each category. For each 2 × 2 table we may calculate the kappa value. The value of a category kappa is a measure of the agreement between the observers on the particular category [13, 14]. The overall kappa is a weighted average of the m category kappas .

The interpretation of the overall kappa as an average of the category kappas has two consequences. On the one hand, if the category kappas are quite different, for example, high agreement on one category but low agreement on another category, the overall kappa cannot fully reflect the complexity of the agreement between the observers . If a researcher is interested in understanding the patterns of agreement and disagreement, it would be good practice to report (various) category coefficients for the individual categories, since this provides substantially more information than reporting only a single number. Alternatively, one can use log-linear or latent class models for modeling agreement . On the other hand, since the overall kappa is a weighted average, its value lies somewhere between the minimum and maximum of the category kappas. The overall kappa thus in a sense summarizes the agreement on the categories. If one is interested in a single number that roughly summarizes the agreement between the observers, which appears to be the case in many applications of Cohen’s kappa, then kappa can be used.

In this paper we present several new interpretations of the overall observed agreement, Cohen’s kappa, and Scott’s pi. The results presented here can be seen as support for the use of these coefficients as summary coefficients of the information on the categories. The paper is organized as follows. In Section 2 we present definitions of various category coefficients and three overall coefficients. The new interpretations are based on the correction for chance function and weighted averaging function of category coefficients. The domains and codomains of these functions are coefficient spaces. These spaces are also defined in Section 2. In Section 3 we define the correction for chance function, study some of its properties, and present an application. In Section 4 we define the weighted averaging function and study some of its properties. As an application of this function it is shown that Cohen’s kappa is an average of Bloch-Kraemer weighted kappas. A numerical illustration of this result is presented in Section 6. Finally, in Section 5 the composition of the correction for chance function and the averaging function is studied. It is shown that the functions commute under composition. It then follows that Cohen’s kappa and Scott’s pi are both averages of chance-corrected category coefficients, as well as chance-corrected versions of a weighted average of the category coefficients. The category coefficients include the sensitivity, specificity, and the positive and negative predictive values of the categories. Section 7 contains a conclusion.

2. Association Coefficients 2.1. Coefficient Spaces

For a population of n subjects, let p i j denote the proportion classified into category i by the first observer and into category j by the second observer, where 1 i , j m . The m categories are nominal. Define (1) p i + = j p i j , p + i = j p j i . The quantities p i + and p + i are the marginal totals of the table { p i j } . They satisfy (2) i p i + = i p + i = 1 . For a fixed number of categories m 2 , association coefficients are here defined as functions from the set of all m × m tables with proportions into the real numbers. The domain of the functions is defined as (3) M = { { p i j } 0 p i j 1 , i , j p i j = 1 } . An association coefficient A is then a function A : M R that assigns a real number to a contingency table. For many association coefficients the codomain is either the closed interval [ 0,1 ] or the interval [ - 1,1 ] . For notational convenience we will assume in this paper that all association coefficients have maximum value unity ( A 1 ).

The set of all association coefficients is given by { A : M R } . For most theoretical studies this set is too big. It turns out that the association coefficients that are used in data-analytic applications in real life belong to specific subsets of { A : M R } . For example, some association coefficients only describe the information for a particular category i . For category i all information is summarized in the element p i i and the totals p i + and p + i . The diagonal element p i i denotes the proportions of subjects classified into category i by both raters. It indicates how often the raters agreed on category i . The marginal totals p i + and p + i indicate how often category i was used by the raters. Let λ i = λ i ( p i + , p + i ) and μ i = μ i ( p i + , p + i ) be functions of the marginal totals p i + and p + i . For category i { 1,2 , , m } we define the set (4) L i = { A : M R A = p i i + λ i μ i , A 1 } . Given fixed marginal totals p i + and p + i , the coefficient space L i consists of all linear transformations of p i i . In the context of a validity study, examples of coefficients in L i are the sensitivity p i i / p i + , the positive predictive value p i i / p + i , and the specificity and the negative predictive value of category i . Additional examples of elements in L i are presented in the next section.

2.2. Examples of Category Coefficients

Since we are only interested in the quantities p i i and p i + and p + i associated with category i , we can collapse the m × m contingency table { p i j } into a 2 × 2 table by combining all categories except category i . Table 1 presents the collapsed 2 × 2 table for category i . A 2 × 2 table can be the result of a reliability study involving two observers but also of a validity study. In the latter case a new test is usually compared to a “more-or-less gold standard.” For example, in a medical test evaluation one has a gold standard evaluation of the presence/absence or type of a disease against which a new test can be assessed. In this paper the rows of the contingency tables are associated with the gold standard, while the columns are associated with the new test.

Collapsed 2 × 2 table for category i .

Observer 1 Observer 2
i All others Total
i p i i p i + - p i i p i +
All others p + i - p i i 1 - p i + - p + i + p i i 1 - p i +

Total p + i 1 - p + i 1

There is a vast literature on association coefficients for 2 × 2 tables . Many of these coefficients are elements of L i . We consider three parameter families.

Example 1.

Let r [ 0,1 ] be a weight and consider for i { 1,2 , , m } the functions (5) ϕ i ( r ) = p i i r p i + + ( 1 - r ) p + i . Coefficient ϕ i ( 1 ) is the sensitivity of category i , while ϕ i ( 0 ) is the positive predictive value. The coefficient ϕ i ( 1 / 2 ) is the coefficient proposed in Dice , a widely used coefficient in ecological biology.

Lemma 2 shows that for all r the function ϕ i ( r ) belongs to L i , the coefficient space associated with category i .

Lemma 2.

One has ϕ i ( r ) L i for all r [ 0,1 ] .

Proof.

We first show that ϕ i ( r ) 1 for all r . We have p i i min { p i + , p + i } , since the value of p i i cannot exceed the marginal totals p i + and p + i . Furthermore, note that for fixed p i + and p + i the set { r p i + + ( 1 - r ) p + i } is convex. It consists of all values between min { p i + , p + i } and max { p i + , p + i } . Since p i + and p + i are nonnegative, all elements in the convex set { r p i + + ( 1 - r ) p + i } are larger than or equal to min { p i + , p + i } . Hence, p i i r p i + + ( 1 - r ) p + i for all r and it follows that ϕ i ( r ) 1 for all r .

Next, we can write ϕ i ( r ) as ( p i i + λ i ) / μ i , where (6a) λ i = 0 , (6b) μ i = r p i + + ( 1 - r ) p + i . Hence, ϕ i ( r ) L i for all r [ 0,1 ] .

Example 3.

Let r , s [ 0,1 ] be weights and consider the function (7) ψ i ( r , s ) = p i i + s ( 1 - p i + - p + i ) r p i + + ( 1 - r ) p + i + s ( 1 - p i + - p + i ) . This two-parameter family was first studied in Warrens . Note that ψ i ( r , 0 ) = ϕ i ( r ) ; that is, if s = 0 we obtain the functions from Example 1. Since ϕ i ( r ) 1 for all r (Lemma 2), we also have ψ i ( r , s ) 1 for all r , s . Furthermore, we can write ψ i ( r , s ) as ( p i i + λ i ) / μ i , where (8) λ i = s ( 1 - p i + - p + i ) , (9) μ i = r p i + + ( 1 - r ) p + i + s ( 1 - p i + - p + i ) . Hence, ψ i ( r , s ) L i for all r , s [ 0,1 ] . Several additional coefficients from the literature are special cases of ψ i ( r , s ) . Coefficient ψ i ( 1 / 2 , 1 / 2 ) is the observed agreement of the collapsed 2 × 2 table associated with category i , while coefficients ψ i ( 0,1 ) and ψ i ( 1,1 ) are, respectively, the specificity and negative predictive value of category i .

Example 4.

For measuring validity in a 2 × 2 study, Bloch and Kraemer  proposed the weighted kappa coefficient. The coefficient is based on an acknowledgment that the clinical consequences of a false negative may be quite different from the clinical consequences of a false positive. A false negative may delay treatment of a patient, while a false positive may result in unnecessary treatment. The Bloch-Kraemer weighted kappa is unique in that it requires that a real number r [ 0,1 ] must be specified a priori indicating the relative importance of the false negatives to the false positives. For category i the weighted kappa is defined as [26, page 273]: (10) κ i ( r ) = p i i - p i + p + i r p i + ( 1 - p + i ) + ( 1 - r ) ( 1 - p i + ) p + i . For all r , coefficient κ i ( r ) can be used in the context of the utility of association . Coefficient (10) is a asymmetric special case of the weighted kappa proposed in Cohen . The latter weighted kappa is widely used with agreement tables with three or more ordinal categories .

Coefficient κ i ( 1 / 2 ) is the ordinary Cohen’s kappa for the 2 × 2 table associated with category i . It is a standard tool in a 2 × 2 reliability study. It is sometimes called the reliability of category i [13, 14]. Coefficient κ i ( 1 ) is the coefficient of conditional agreement proposed in Coleman  (see [32, page 367], and [33, page 397]). This coefficient can be used if one is interested in the agreement between the observers for those subjects which the first observer assigned to category i .

Since (11) r p i + ( 1 - p + i ) + ( 1 - r ) ( 1 - p i + ) p + i = r p i + - r p i + p + i + ( 1 - r ) p + i - p i + p + i + r p i + p + i = r p i + + ( 1 - r ) p + i - p i + p + i , we can write (10) as (12) κ i ( r ) = p i i - p i + p + i r p i + + ( 1 - r ) p + i - p i + p + i . We can write (12) as ( p i i + λ i ) / μ i , where (13a) λ i = - p i + p + i , (13b) μ i = r p i + + ( 1 - r ) p + i - p i + p + i . Hence, κ i ( r ) L i for all r [ 0,1 ] .

Example 5.

For the 2 × 2 table associated with category i , the intraclass kappa [26, page 276] can be defined as (14) π i = p i i - ( ( p i + + p + i ) / 2 ) 2 ( p i + + p + i ) / 2 - ( ( p i + + p + i ) / 2 ) 2 . The letter π was originally used by Scott . Bloch and Kraemer  showed that this coefficient can be used in the context of agreement. The intraclass kappa satisfies the classical definition of reliability [15, 18]. We can write (14) as ( p i i + λ i ) / μ i , where (15a) λ i = - ( p i + + p + i 2 ) 2 , (15b) μ i = p i + + p + i 2 - ( p i + + p + i 2 ) 2 . Hence, π i L i .

2.3. Examples of Overall Coefficients

Coefficients in the sets L i for i { 1,2 , , m } only describe the information of one category at a time. Other association coefficients summarize the information in all categories at once. Let (16) λ = λ ( p 1 + , , p m + , p + 1 , , p + m ) , μ = μ ( p 1 + , , p m + , p + 1 , , p + m ) be functions of the marginal totals and define the set (17) L = { A : M R A = i p i i + λ μ , A 1 } . Given fixed marginal totals the coefficient space L consists of all linear transformations of the overall observed agreement i p i i . Clearly, i p i i is an element of L . Other examples are Cohen’s kappa and Scott’s pi. The population value of Cohen’s kappa is defined as  (18) κ = i p i i - i p i + p + i 1 - i p i + p + i . The numerator of kappa is the difference between the actual probability of agreement and the probability of agreement in the case of statistical independence of the ratings. The denominator of kappa is the maximum possible value of the numerator. Kappa has value 1 when there is perfect agreement between the observers, 0 when agreement is equal to that expected by chance, and a negative value when agreement is less than that expected by chance. We can write kappa as ( i p i i + λ ) / μ , where (19a) λ = - i p i + p + i , (19b) μ = 1 - i p i + p + i . The population value of Scott’s pi is defined as [2, 9, 11] (20) π = i p i i - i ( ( p i + + p + i ) / 2 ) 2 1 - i ( ( p i + + p + i ) / 2 ) 2 . The differences in the definitions of agreement under chance are discussed in Examples 9 and 10 in the next section. We always have the inequality κ π .

3. Correction for Chance

In this section we define the correction for chance function. The expectation E ( A ) of a coefficient A is conditionally upon fixed marginal totals. The correction for chance function is denoted by C . For A L i it is defined as (21) C : L i L i , A A - E ( A ) 1 - E ( A ) . For an association coefficient A L the correction for chance function is defined as (22) C : L L , A A - E ( A ) 1 - E ( A ) . The short formula is in both cases given by [3, 22, 35] (23) C ( A ) = A - E ( A ) 1 - E ( A ) . We assume in (23) that E ( A ) < 1 to avoid indeterminacy. Lemma 6 presents an alternative expression for C ( A ) if A L i .

Lemma 6.

Let A L i with A = ( p i i + λ i ) / μ i . One has (24) C ( A ) = p i i - E ( p i i ) μ i - λ i - E ( p i i ) .

Proof.

Let A L i with A = ( p i i + λ i ) / μ i . Since E is a linear operator we have (25) E ( A ) = E ( p i i ) + λ i μ i . Using A and E ( A ) in (25) in (23) and multiplying all terms of the result by μ i , we obtain the expression in (24).

Lemma 7 presents an alternative expression for C ( A ) if A L . The proof of Lemma 7 is similar to the proof of Lemma 6.

Lemma 7.

Let A L with A = ( i p i i + λ ) / μ . One has (26) C ( A ) = i ( p i i - E ( p i i ) ) μ - λ - i E ( p i i ) .

The function C is a map from L i to L i if L i is closed under C . Lemma 8 shows that this is the case.

Lemma 8.

The spaces L i and L are closed under C .

Proof.

We present the proof for A L i only. The proof for A L follows from using similar arguments.

Let A L i with A = ( p i i + λ i ) / μ i . The formula for C ( A ) is presented in (24). Since E ( p i i ) is a function of the marginal totals p i + and p + i we can write C ( A ) as ( p i i + λ i * ) / μ i * , where (27a) λ i * = - E ( p i i ) , (27b) μ i * = μ i - λ i - E ( p i i ) . Hence, C ( A ) L i , and the result follows.

Formula (24) shows that elements of L i coincide after correction for chance if they have the same difference μ i - λ i , regardless of the choice of E ( p i i ) . This suggests the following definition. Two coefficients A 1 , A 2 L i are said to be equivalent with respect to (24), denoted by A 1 ~ A 2 , if they have the same difference μ i - λ i . It can be shown that ~ is an equivalence relation on L i . The equivalence relation ~ divides the elements of L i into equivalence classes, one class for each value of the difference μ i - λ i .

Different definitions of E ( p i i ) provide different versions of the correction for chance formula. We consider two examples of E ( p i i ) . Additional examples can be found in [2, 3, 11, 22].

Example 9.

The expected value of p i i under statistical independence is given by (28) E ( p i i ) = p i + p + i . In this case we assume that the data are a product of chance concerning two different frequency distributions.

Example 10.

Alternatively, we may assume that the data are a product of chance concerning a single frequency distribution [9, 11]. The common parameter is usually estimated by the arithmetic mean of the marginals totals p i + and p + i . Hence, in this case we have (29) E ( p i i ) = ( p i + + p + i 2 ) 2 .

Lemma 11 presents an application of the correction for chance function. In Lemma 11 the function is combined with Example 9. The result shows how the functions in Examples 1 and 3 are related to the function κ i ( r ) in Example 4.

Lemma 11.

Assume (28) holds. Then C ( ψ i ( r , s ) ) = κ i ( r ) for all r and s .

Proof.

Using λ i and μ i in (8) and (9) we have (30) μ i - λ i = r p i + + ( 1 - r ) p + i . Using (28) and (30) in (24) we obtain κ i ( r ) in (12).

4. Averaging over Categories

In this section we define a function that connects the association coefficients in the coefficient spaces L 1 , L 2 , , L m to the coefficients in the space L . For i { 1,2 , , m } let A i L i with A i = ( p i i + λ i ) / μ i . For these m coefficients we define the function (31) W : L 1 × L 2 × × L m L , ( A 1 , A 2 , , A m ) i ( p i i + λ i ) i μ i , or (32) W ( A 1 , A 2 , , A m ) = i ( p i i + λ i ) i μ i . Thus, W ( A 1 , A 2 , , A m ) is the weighted average of the A i using the denominators μ i of the A i as weights. This weighted average is similar to the arithmetic mean of the category coefficients. In the calculation of the arithmetic mean each category coefficient contributes equally to the final average. In the calculation of W some category coefficients contribute more than others. We check whether function (32) is well-defined.

Lemma 12.

Function (32) is well-defined.

Proof.

It must be shown that (33) A = i p i i + i λ i i μ i is an element of L . Since λ i and μ i each are functions of the marginal totals p i + and p + i , the sums i λ i and i μ i are also functions of the marginal totals. Hence, we can write A = ( i p i i + i λ i ) / i μ i as A = ( i p i i + λ ) / μ , where (34a) λ = i λ i , (34b) μ = i μ i , from which the result follows.

In the remainder of this section we consider some results associated with the weighted average function in (32). If we fix r , then (5) provides m association coefficients for describing the agreement between the observers, one for each category. Lemma 13 shows that a weighted average of these coefficients is equivalent to the overall observed agreement i p i i , regardless of the value of r .

Lemma 13.

Let r [ 0,1 ] be fixed. One has (35) W ( ϕ 1 ( r ) , ϕ 2 ( r ) , , ϕ m ( r ) ) = i p i i .

Proof.

The formula of W is presented in (32). Using λ i and μ i in (6a) and (6b) we have (36) i ( p i i + λ i ) = i p i i , and, using identity (2), (37) i μ i = r i p i + + ( 1 - r ) i p + i = r + 1 - r = 1 .

If we fix r , then (12) provides us with m Bloch-Kraemer weighted kappas for describing the agreement between the observers, one for each category. Lemma 14 shows that a weighted average of these coefficients is equivalent to Cohen’s kappa in (18), regardless of our choice of r .

Lemma 14.

Let r [ 0,1 ] be fixed. One has (38) W ( κ 1 ( r ) , κ 2 ( r ) , , κ m ( r ) ) = κ .

Proof.

The formula of W is presented in (32). Using λ i and μ i in (13a) and (13b) we have (39) i ( p i i + λ i ) = i ( p i i - p i + p + i ) , which is the numerator of κ , and, using identity (2), (40) i μ i = i ( r p i + + ( 1 - r ) p + i - p i + p + i ) = r i p i + + ( 1 - r ) i p + i - i p i + p + i = r + 1 - r - i p i + p + i = 1 - i p i + p + i , which is the denominator of κ .

Lemma 15 shows that if we apply W to the intraclass kappas π i in Example 5 then we obtain Scott’s pi.

Lemma 15.

One has (41) W ( π 1 , π 2 , , π m ) = π .

Proof.

The formula of W is presented in (32). Using λ i and μ i in (15a) and (15b) we have (42) i ( p i i + λ i ) = i p i i - i ( p i + + p + i 2 ) 2 , which is the numerator of π , and, using identity (2), (43) i μ i = i p i + + p + i 2 - i ( p i + + p + i 2 ) 2 = 1 - i ( p i + + p + i 2 ) 2 , which is the denominator of π .

5. Composition of Functions

In Sections 3 and 4 we studied the correction for chance function and the weighted average function separately. In this section we study the composition of the two functions. Lemma 16 shows that the two functions commute. Hence, changing the order of the functions does not change the result.

Lemma 16.

For i { 1,2 , , m } let A i L i with A i = ( p i i + λ i ) / μ i . One has (44) W ( C ( A 1 ) , C ( A 2 ) , , C ( A m ) ) = C ( W ( A 1 , A 2 , , A m ) ) .

Proof.

We will show that both compositions are equivalent to (45) i ( p i i - E ( p i i ) ) i ( μ i - λ i ) - i E ( p i i ) . The formula for the C ( A i ) is presented in (24). Adding the numerators of (24) we obtain the numerator of (45) and adding the denominators of (24) we obtain the denominator of (45). Hence, W ( C ( A 1 ) , C ( A 2 ) , , C ( A m ) ) is equivalent to (45).

The formula for W ( A 1 , A 2 , , A m ) is presented in (32). The coefficient can be written as ( i p i i + λ ) / μ , where λ and μ are presented in (34a) and (34b). Using this λ and μ in (26) we also obtain (45).

Lemma 16 shows that we can either take the average of the chance-corrected versions of coefficients A 1 , A 2 , , A m or take a weighted average of coefficients and then correct the overall coefficient for agreement due to chance. The result will be the same. Coefficient (45) contains two quantities that must be specified, namely, the expectation E ( p i i ) and the sum of the differences μ i - λ i . Using, for fixed r , λ i and μ i in (6a) and (6b), (8) and (9), (13a) and (13b), or (15a) and (15b) we obtain (46) i ( μ i - λ i ) = 1 . Identity (46) shows that all coefficients discussed in Section 2 belong to a specific family of linear transformations. An example of a coefficient that does not belong to this family is the phi coefficient in (50). For other examples, see .

Using identity (46) in (45) we obtain the overall coefficient (47) i ( p i i - E ( p i i ) ) 1 - i E ( p i i ) . If we use E ( p i i ) in (28) in (47) we obtain Cohen’s kappa, whereas if we use E ( p i i ) in (29) in (47) we obtain Scott’s pi. The overall kappa is not a weighted average of phi coefficients.

6. A Numerical Illustration

In this section we present a numerical illustration of Lemma 14, which shows that for fixed r Cohen’s kappa is a weighted average of the Bloch-Kraemer weighted kappas associated with each category. Let n i j denote the observed number of subjects that are classified into category i by the first observer and into category j by the second observer. Assuming a multinominal sampling model with the total numbers of subjects n fixed, the maximum likelihood estimate of the cell probability p i j is given by p ^ i j = n i j / n . We obtain the maximum likelihood estimates κ ^ i ( r ) and κ ^ by replacing the cell probabilities p i j by the p ^ i j in the Bloch-Kraemer weighted kappas in (12) and Cohen’s kappa in (18) [33, page 396]. Let (48) θ 1 = i = 1 m p ^ i i , θ 3 = i = 1 m p ^ i i ( p ^ i + + p ^ + i ) , θ 2 = i = 1 m p ^ i + p ^ + i , θ 4 = i = 1 m j = 1 m p ^ i j ( p ^ + i + p ^ j + ) 2 . The approximate large sample variance of κ ^ [33, 34, 36] is given by (49) σ 2 ( κ ^ ) = 1 n [ θ 1 ( 1 - θ 1 ) ( 1 - θ 2 ) 2 + 2 ( 1 - θ 1 ) ( 2 θ 1 θ 2 - θ 3 ) ( 1 - θ 2 ) 3 ( 1 - θ 1 ) 2 ( θ 4 - 4 θ 2 2 ) ( 1 - θ 2 ) 4 s s s s s s s + ( 1 - θ 1 ) 2 ( θ 4 - 4 θ 2 2 ) ( 1 - θ 2 ) 4 ] . The product-moment correlation coefficient or phi coefficient for the 2 × 2 table associated with category i is given by (50) ρ i = p ^ i i - p ^ i + p ^ + i p ^ i + ( 1 - p ^ i + ) p ^ + i ( 1 - p ^ + i ) . The asymptotic variance [26, page 279] of κ ^ i ( r ) is given by (51) σ 2 ( κ ^ i ( r ) ) = p ^ i + ( 1 - p ^ i + ) p ^ + i ( 1 - p ^ + i ) n [ r p ^ i + ( 1 - p ^ + i ) + ( 1 - r ) ( 1 - p ^ i + ) p ^ + i ] 2 · V , where (52) V = 1 + 4 U i + U + i ρ i - ( 1 + 3 U i + 2 + 3 U + i 2 ) ρ i 2 + 2 U i + U + i ρ i 3 , U i + = ( 1 / 2 ) - p ^ i + p ^ i + ( 1 - p ^ i + ) , U + i = ( 1 / 2 ) - p ^ + i p ^ + i ( 1 - p ^ + i ) .

To illustrate Lemma 14 we consider the data in Table 2 taken from Fennig et al. . These authors investigated the accuracy of clinical diagnosis in psychotic patients. As a gold standard they used the ratings of two project psychiatrists, called the research diagnosis. Table 2 presents the cross-classification of the research and clinical diagnoses. The estimate of the overall kappa for these data is κ ^ = 0.432 with 95% confidence interval (0.341–0.522), indicating a moderate overall level of agreement. Table 3 presents the estimates of the Bloch-Kraemer weighted kappas for the four categories, labeled S, B, D, and O, for five distinct values of r . The table also presents the associated 95% confidence intervals between parentheses.

Research and clinical diagnoses of disorders in 223 psychotic patients .

Research diagnosis Clinical diagnosis
Schizophrenia Bipolar disorder Depression Other Total
Schizophrenia 40 6 4 15 65
Bipolar disorder 4 25 1 5 35
Depression 4 2 21 9 36
Other 17 13 12 45 87

Total 65 46 38 74 223

Bloch-Kraemer weighted kappas for categories Schizophrenia, Bipolar disorder, Depression, and Other, for r { 0 , 1 / 3 , 1 / 2 , 2 / 3 , 1 } .

r Schizophrenia Bipolar disorder Depression Other
κ ^ S    ( r ) 95% CI κ ^ B    ( r ) 95% CI κ ^ D    ( r ) 95% CI κ ^ O    ( r ) 95% CI
0 0.457 (0.330–0.585) 0.458 (0.339–0.578) 0.467 (0.318–0.616) 0.357 (0.213–0.503)
1 3 0.457 (0.330–0.585) 0.506 (0.375–0.639) 0.476 (0.325–0.629) 0.326 (0.194–0.459)
1 2 0.457 (0.330–0.585) 0.534 (0.396–0.674) 0.482 (0.328–0.636) 0.312 (0.186–0.440)
2 3 0.457 (0.330–0.585) 0.565 (0.419–0.713) 0.487 (0.332–0.643) 0.300 (0.178–0.422)
1 0.457 (0.330–0.585) 0.640 (0.474–0.807) 0.498 (0.339–0.657) 0.277 (0.165–0.390)

The statistics for category Schizophrenia in Table 3 are equivalent for all values of r because p ^ S + = p ^ + S = 65 / 223 = 0.291 . We have κ ^ S ( r ) = 0.457 with 95% confidence interval (0.330–0.585), indicating a moderate level of agreement on Schizophrenia. The level of agreement on the other categories depends on the value of r . The agreement on categories Bipolar disorder and Depression is higher than that of Schizophrenia for all values of r , while the agreement on category Other is lowest for all values of r . Finally, recall that, for fixed r , the overall kappa is a weighted average of the Bloch-Kraemer weighted kappas. For example, for r = 0 we have (53) ( ( 0.207 ) ( 0.457 ) + ( 0.174 ) ( 0.458 ) s + ( 0.143 ) ( 0.467 ) + ( 0.202 ) ( 0.357 ) ) × ( 0.207 + 0.174 + 0.143 + 0.202 ) - 1 = 0.432 , and for r = 2 / 3 we have (54) ( ( 0.207 ) ( 0.457 ) + ( 0.141 ) ( 0.565 ) s + ( 0.137 ) ( 0.487 ) + ( 0.241 ) ( 0.300 ) ) × ( 0.207 + 0.141 + 0.137 + 0.241 ) - 1 = 0.432 .

The data in Tables 2 and 3 show that if we use the same category coefficients for all categories, then the coefficients in general produce different values. This observation holds for almost all real life data. Table 4 presents a hypothetical data set with three nominal categories. Table 5 presents the corresponding estimates of the Bloch-Kraemer weighted kappas for the three categories, labeled A, B, and C, for five distinct values of r and the associated 95% confidence intervals. The statistics for category B in Table 5 are equivalent for all values of r because p ^ B + = p ^ + B = 120 / 174 = 0.690 . The estimate of the overall kappa for these data is κ ^ = 0.356 with 95% confidence interval (0.229–0.482). Furthermore, all the estimates of the category kappas κ ^ i ( 1 / 2 ) have the same value 0.356. Thus, in this hypothetical case the overall kappa is a perfect summary coefficient of the three category kappas. Due to Lemma 14, we know that the overall kappa also roughly summarizes the other Bloch-Kraemer weighted kappas. However, these weighted kappas have quite distinct values. These data illustrate that while the overall kappa is always a summary coefficient of all types of Bloch-Kraemer category kappas, it can be a perfect summary coefficient for a particular type of weighted kappas. On the contrary, while the overall kappa may summarize one type of category coefficients perfectly, it can still be a poor summary coefficient for other types of category coefficients.

Hypothetical diagnoses of three disorders in 174 psychotic patients.

Diagnosis I Diagnosis II Total
A B C
Type A 12 0 6 18
Type B 24 96 0 120
Type C 0 24 12 36

Total 36 120 18 174

Bloch-Kraemer weighted kappas for categories A, B, and C, for r { 0 , 1 / 3 , 1 / 2 , 2 / 3 , 1 } .

r A B C
κ ^ A ( r ) 95% CI κ ^ B ( r ) 95% CI κ ^ C ( r ) 95% CI
0 0.256 (0.139–0.374) 0.356 (0.208–0.504) 0.580 (0.315–0.846)
1 3 0.315 (0.171–0.460) 0.356 (0.208–0.504) 0.408 (0.221–0.596)
1 2 0.356 (0.193–0.519) 0.356 (0.208–0.504) 0.356 (0.193–0.519)
2 3 0.408 (0.221–0.596) 0.356 (0.208–0.504) 0.315 (0.171–0.460)
1 0.580 (0.315–0.846) 0.356 (0.208–0.504) 0.256 (0.139–0.374)
7. Conclusion

Cohen’s kappa is a commonly used association measure for summarizing agreement between two observers on a nominal scale. The coefficient reduces the ratings of the two observers to a single real number. In general, this leads to a substantial loss of information. A more complete picture of the interobserver agreement is obtained by assessing the degree of agreement on the individual categories . There are various association coefficients that can be used to describe the information for each category separately. Examples are the sensitivity and specificity of a category, the positive predictive value, negative predictive value, and the Bloch-Kraemer weighted kappa. Once we have selected a category coefficient we have multiple coefficients describing the agreement between the observers, one for each category. If one is interested in a single number that roughly summarizes the agreement between the observers, what overall coefficient should be used? The results derived in this paper show that the overall observed agreement, Cohen’s kappa, and Scott’s pi are proper overall coefficients. Each coefficient is a weighted average of certain category coefficients and therefore its value lies somewhere between the minimum and maximum of the category coefficients. We enumerate some of the new interpretations that were found.

Suppose each category coefficient is the same special case of the function in (5). Examples are the sensitivity, positive predictive value, and the Dice coefficient. The observed agreement is a weighted average of the category coefficients (Lemma 13).

Suppose that each category coefficient is the same Bloch-Kraemer weighted kappa in (12). Then Cohen’s kappa is a weighted average of the weighted kappas (Lemma 14).

Suppose that each category coefficient is the intraclass kappa in (14). Then Scott’s pi is a weighted average of the intraclass kappas (Lemma 15).

Suppose that the value of a coefficient under chance is the value under statistical independence. Furthermore, suppose that each category coefficient is the same special case of the general function in (7). Examples are the sensitivity, specificity, positive predictive value, negative predictive value, the observed agreement, and the Dice coefficient. Then Cohen’s kappa is both a weighted average of the chance-corrected category coefficients and a chance-corrected version of a weighted average of the category coefficients (Lemma 16).

An illustration of Lemma 14 was presented in Section 6. The lemmas presented in this paper show that there is an abundance of category coefficients of which the observed agreement and Cohen’s kappa are summary coefficients. The results provide a basis for using these overall coefficients if one is only interested in a single number that roughly summarizes the agreement between the observers. If, on the other hand, one is interested in understanding the patterns of agreement and disagreement, one can report various category coefficients for the individual categories or consider log-linear or latent class models that can be used to model the agreement .

Conflict of Interests

The author declares that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This research is part of Veni Project 451-11-026 funded by The Netherlands Organisation for Scientific Research.

Hsu L. M. Field R. Interrater agreement measures: comments on Kappa n, Cohen's Kappa, Scott's π , and Aickin's α Understanding Statistics 2003 2 205 219 Krippendorff K. Reliability in content analysis: some common misconceptions and recommendations Human Communication Research 2004 30 3 411 433 10.1093/hcr/30.3.411 2-s2.0-4043124032 Warrens M. J. Inequalities between kappa and kappa-like statistics for K x K tables Psychometrika 2010 75 1 176 185 10.1007/s11336-009-9138-8 MR2609736 2-s2.0-77950076141 Cohen J. A coefficient of agreement for nominal scales Educational and Psychological Measurement 1960 20 37 46 10.1177/001316446002000104 Hanley J. A. Standard error of the kappa statistic Psychological Bulletin 1987 102 2 315 321 10.1037/0033-2909.102.2.315 2-s2.0-0000165343 Maclure M. Willett W. C. Misinterpretation and misuse of the Kappa statistic American Journal of Epidemiology 1987 126 2 161 169 2-s2.0-0023250550 Warrens M. J. On the equivalence of Cohen's kappa and the Hubert-Arabie adjusted Rand index Journal of Classification 2008 25 2 177 183 10.1007/s00357-008-9023-7 MR2476110 2-s2.0-58149498320 Warrens M. J. Cohen's kappa can always be increased and decreased by combining categories Statistical Methodology 2010 7 6 673 677 10.1016/j.stamet.2010.05.003 MR2728420 ZBL1232.62161 2-s2.0-77956601127 Scott W. A. Reliability of content analysis: the case of nominal scale coding Public Opinion Quarterly 1955 19 3 321 325 10.1086/266577 2-s2.0-33748063869 Krippendorff K. Content Analysis: An Introduction to Its Methodology 2004 2nd Thousands Oaks, Calif, USA Sage Krippendorff K. Association, agreement, and equity Quality and Quantity 1987 21 2 109 123 10.1007/BF00167603 2-s2.0-4444293059 Fleiss J. L. Measuring agreement between two judges on the presence or absence of a trait Biometrics 1975 31 3 651 659 MR0388666 2-s2.0-0016750401 Fleiss J. L. Statistical Methods for Rates and Proportions 1981 New York, NY, USA Wiley MR622544 Fleiss J. L. Levin B. Paik M. C. Statistical Methods for Rates and Proportions 2003 3rd New York, NY, USA John Wiley & Sons 10.1002/0471445428 MR2001202 Kraemer H. C. Ramifications of a population model for κ as a coefficient of reliability Psychometrika 1979 44 4 461 472 10.1007/BF02296208 2-s2.0-0001586045 Vanbelle S. Albert A. Agreement between two independent groups of raters Psychometrika 2009 74 3 477 491 10.1007/s11336-009-9116-1 MR2551672 2-s2.0-63649114878 Warrens M. J. Cohen's kappa is a weighted average Statistical Methodology 2011 8 6 473 484 10.1016/j.stamet.2011.06.002 MR2834034 2-s2.0-80052677779 Kraemer H. C. Periyakoil V. S. Noda A. Kappa coefficients in medical research Statistics in Medicine 2002 21 14 2109 2129 10.1002/sim.1180 2-s2.0-0037199386 Agresti A. Modelling patterns of agreement and disagreement Statistical Methods in Medical Research 1992 1 2 201 218 10.1177/096228029200100205 2-s2.0-0026959051 Fennig S. Craig T. J. Tanenberg-Karant M. Bromet E. J. Comparison of facility and research diagnoses in first-admission psychotic patients The American Journal of Psychiatry 1994 151 10 1423 1429 2-s2.0-0028069949 Baulieu F. B. A classification of presence/absence based dissimilarity coefficients Journal of Classification 1989 6 2 233 246 10.1007/BF01908601 MR1040742 2-s2.0-0010033545 Warrens M. J. On similarity coefficients for 2 × 2 tables and correction for chance Psychometrika 2008 73 3 487 502 10.1007/s11336-008-9059-y MR2447327 2-s2.0-52949107579 Warrens M. J. On association coefficients for 2 × 2 tables and properties that do not depend on the marginal distributions Psychometrika 2008 73 4 777 789 10.1007/s11336-008-9070-3 MR2469792 ZBL1284.62762 2-s2.0-62949147818 Warrens M. J. Chance-corrected measures for 2 × 2 tables that coincide with weighted kappa British Journal of Mathematical and Statistical Psychology 2011 64 2 355 365 10.1348/2044-8317.002001 MR2816784 2-s2.0-79954596997 Dice L. R. Measures of the amount of ecologic association between species Ecology 1945 26 297 302 10.2307/1932409 Bloch D. A. Kraemer H. C. 2 x 2 kappa coefficients: measures of agreement or association Biometrics 1989 45 1 269 287 10.2307/2532052 2-s2.0-0024513436 Cohen J. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit Psychological Bulletin 1968 70 4 213 220 10.1037/h0026256 2-s2.0-58149412516 Warrens M. J. Some paradoxical results for the quadratically weighted kappa Psychometrika 2012 77 2 315 323 10.1007/s11336-012-9258-4 MR2909432 ZBL1284.62764 2-s2.0-84858451115 Warrens M. J. Equivalences of weighted kappas for multiple raters Statistical Methodology 2012 9 3 407 422 10.1016/j.stamet.2011.11.001 MR2871441 2-s2.0-84855201526 Warrens M. J. Conditional inequalities between Cohen's kappa and weighted kappas Statistical Methodology 2013 10 14 22 10.1016/j.stamet.2012.05.004 MR2974806 2-s2.0-84863309440 Coleman J. S. Measures of concordance or consensus between members of social groups Johns Hopkins University, 1966 Light R. J. Measures of response agreement for qualitative data: some generalizations and alternatives Psychological Bulletin 1971 76 5 365 377 10.1037/h0031643 2-s2.0-0001587595 Bishop Y. M. M. Fienberg S. E. Holland P. W. Discrete Multivariate Analysis: Theory and Practice 1975 Cambridge, UK MIT Press Agresti A. Categorical Data Analysis 2002 2nd Hoboken, NJ, USA Wiley-Interscience Wiley Series in Probability and Statistics 10.1002/0471249688 MR1914507 Albatineh A. N. Niewiadomska-Bugaj M. Mihalko D. On similarity indices and correction for chance agreement Journal of Classification 2006 23 2 301 313 10.1007/s00357-006-0017-z MR2295924 2-s2.0-33750732338 Fleiss J. L. Cohen J. Everitt B. S. Large sample standard errors of kappa and weighted kappa Psychological Bulletin 1969 72 5 323 327 10.1037/h0028106 2-s2.0-33645066726