Cohen’s kappa is a widely used association coefficient for summarizing interrater agreement on a nominal scale. Kappa reduces the ratings of the two observers to a single number. With three or more categories it is more informative to summarize the ratings by category coefficients that describe the information for each category separately. Examples of category coefficients are the sensitivity or specificity of a category or the Bloch-Kraemer weighted kappa. However, in many research studies one is often only interested in a single overall number that roughly summarizes the agreement. It is shown that both the overall observed agreement and Cohen’s kappa are weighted averages of various category coefficients and thus can be used to summarize these category coefficients.
1. Introduction
In various fields of science it is frequently required that an observer classifies a set of subjects into three or more nominal categories that are defined in advance. The observer may be a clinician who classifies children on the severity of a disease, a pathologist that rates the severity of lesions from scans, or a coder that transcribes interviews. If the observer did not fully understand what he or she was asked to interpret, or if the definition of the categories is ambiguous, the reliability of the rating system is at stake. To assess the reliability of the system researchers typically ask two or more observers to rate the same set of subjects independently. An analysis of the agreement between the observers can then be used as an indicator of the quality of the category definitions and the raters’ ability to apply them. High agreement between the ratings would indicate consensus in the diagnosis and interchangeability of the ratings.
There are several association coefficients that can be used for summarizing agreement between two observers [1–3]. In biomedical and behavioral science research the most widely used coefficient for summarizing agreement on a scale with two or more nominal categories is Cohen’s kappa [4–8]. The coefficient has been applied in thousand of research studies and is also frequently used for summarizing agreement if we have n observers of one type paired with n observers of a second type, and each of the 2n observers assigns a subject to one of m categories. A closely related coefficient is Scott’s pi [9]. The latter coefficient is commonly used in the field of content analysis [2, 10]. The two coefficients have similar formulas and differ in how agreement under chance is defined [3, 11].
Cohen’s kappa reduces the ratings of the two observers to a single real number. To provide a proper interpretation of the coefficient one must first understand its meaning. There are two descriptions of kappa in the literature. The observed or raw agreement is the proportion of subjects that is classified into the same nominal categories by both observers. Several authors have argued that the overall observed agreement is artificially high and should be corrected for agreement due to chance [4, 6, 12]. Kappa can be described as a chance-corrected version of the observed agreement. The second interpretation of kappa involves the 2×2 tables that are obtained by combining all the categories of the agreement table other than the one of current interest into a single category. If we have m categories, there are m associated 2×2 tables, one for each category. For each 2×2 table we may calculate the kappa value. The value of a category kappa is a measure of the agreement between the observers on the particular category [13, 14]. The overall kappa is a weighted average of the m category kappas [15–17].
The interpretation of the overall kappa as an average of the category kappas has two consequences. On the one hand, if the category kappas are quite different, for example, high agreement on one category but low agreement on another category, the overall kappa cannot fully reflect the complexity of the agreement between the observers [18]. If a researcher is interested in understanding the patterns of agreement and disagreement, it would be good practice to report (various) category coefficients for the individual categories, since this provides substantially more information than reporting only a single number. Alternatively, one can use log-linear or latent class models for modeling agreement [19]. On the other hand, since the overall kappa is a weighted average, its value lies somewhere between the minimum and maximum of the category kappas. The overall kappa thus in a sense summarizes the agreement on the categories. If one is interested in a single number that roughly summarizes the agreement between the observers, which appears to be the case in many applications of Cohen’s kappa, then kappa can be used.
In this paper we present several new interpretations of the overall observed agreement, Cohen’s kappa, and Scott’s pi. The results presented here can be seen as support for the use of these coefficients as summary coefficients of the information on the categories. The paper is organized as follows. In Section 2 we present definitions of various category coefficients and three overall coefficients. The new interpretations are based on the correction for chance function and weighted averaging function of category coefficients. The domains and codomains of these functions are coefficient spaces. These spaces are also defined in Section 2. In Section 3 we define the correction for chance function, study some of its properties, and present an application. In Section 4 we define the weighted averaging function and study some of its properties. As an application of this function it is shown that Cohen’s kappa is an average of Bloch-Kraemer weighted kappas. A numerical illustration of this result is presented in Section 6. Finally, in Section 5 the composition of the correction for chance function and the averaging function is studied. It is shown that the functions commute under composition. It then follows that Cohen’s kappa and Scott’s pi are both averages of chance-corrected category coefficients, as well as chance-corrected versions of a weighted average of the category coefficients. The category coefficients include the sensitivity, specificity, and the positive and negative predictive values of the categories. Section 7 contains a conclusion.
2. Association Coefficients2.1. Coefficient Spaces
For a population of n subjects, let pij denote the proportion classified into category i by the first observer and into category j by the second observer, where 1≤i, j≤m. The m categories are nominal. Define
(1)pi+=∑jpij,p+i=∑jpji.
The quantities pi+ and p+i are the marginal totals of the table {pij}. They satisfy
(2)∑ipi+=∑ip+i=1.
For a fixed number of categories m≥2, association coefficients are here defined as functions from the set of all m×m tables with proportions into the real numbers. The domain of the functions is defined as
(3)M={{pij}∣0≤pij≤1,∑i,jpij=1}.
An association coefficient A is then a function A:M→R that assigns a real number to a contingency table. For many association coefficients the codomain is either the closed interval [0,1] or the interval [-1,1]. For notational convenience we will assume in this paper that all association coefficients have maximum value unity (A≤1).
The set of all association coefficients is given by {A:M→R}. For most theoretical studies this set is too big. It turns out that the association coefficients that are used in data-analytic applications in real life belong to specific subsets of {A:M→R}. For example, some association coefficients only describe the information for a particular category i. For category i all information is summarized in the element pii and the totals pi+ and p+i. The diagonal element pii denotes the proportions of subjects classified into category i by both raters. It indicates how often the raters agreed on category i. The marginal totals pi+ and p+i indicate how often category i was used by the raters. Let λi=λi(pi+,p+i) and μi=μi(pi+,p+i) be functions of the marginal totals pi+ and p+i. For category i∈{1,2,…,m} we define the set
(4)Li={A:M⟶R∣A=pii+λiμi,A≤1}.
Given fixed marginal totals pi+ and p+i, the coefficient space Li consists of all linear transformations of pii. In the context of a validity study, examples of coefficients in Li are the sensitivity pii/pi+, the positive predictive value pii/p+i, and the specificity and the negative predictive value of category i. Additional examples of elements in Li are presented in the next section.
2.2. Examples of Category Coefficients
Since we are only interested in the quantities pii and pi+ and p+i associated with category i, we can collapse the m×m contingency table {pij} into a 2×2 table by combining all categories except category i. Table 1 presents the collapsed 2×2 table for category i. A 2×2 table can be the result of a reliability study involving two observers but also of a validity study. In the latter case a new test is usually compared to a “more-or-less gold standard.” For example, in a medical test evaluation one has a gold standard evaluation of the presence/absence or type of a disease against which a new test can be assessed. In this paper the rows of the contingency tables are associated with the gold standard, while the columns are associated with the new test.
Collapsed 2×2 table for category i.
Observer 1
Observer 2
i
All others
Total
i
pii
pi+-pii
pi+
All others
p+i-pii
1-pi+-p+i+pii
1-pi+
Total
p+i
1-p+i
1
There is a vast literature on association coefficients for 2×2 tables [21–24]. Many of these coefficients are elements of Li. We consider three parameter families.
Example 1.
Let r∈[0,1] be a weight and consider for i∈{1,2,…,m} the functions
(5)ϕi(r)=piirpi++(1-r)p+i.
Coefficient ϕi(1) is the sensitivity of category i, while ϕi(0) is the positive predictive value. The coefficient ϕi(1/2) is the coefficient proposed in Dice [25], a widely used coefficient in ecological biology.
Lemma 2 shows that for all r the function ϕi(r) belongs to Li, the coefficient space associated with category i.
Lemma 2.
One has ϕi(r)∈Li for all r∈[0,1].
Proof.
We first show that ϕi(r)≤1 for all r. We have pii≤min{pi+,p+i}, since the value of pii cannot exceed the marginal totals pi+ and p+i. Furthermore, note that for fixed pi+ and p+i the set {rpi++(1-r)p+i} is convex. It consists of all values between min{pi+,p+i} and max{pi+,p+i}. Since pi+ and p+i are nonnegative, all elements in the convex set {rpi++(1-r)p+i} are larger than or equal to min{pi+,p+i}. Hence, pii≤rpi++(1-r)p+i for all r and it follows that ϕi(r)≤1 for all r.
Next, we can write ϕi(r) as (pii+λi)/μi, where(6a)λi=0,(6b)μi=rpi++(1-r)p+i.
Hence, ϕi(r)∈Li for all r∈[0,1].
Example 3.
Let r,s∈[0,1] be weights and consider the function
(7)ψi(r,s)=pii+s(1-pi+-p+i)rpi++(1-r)p+i+s(1-pi+-p+i).
This two-parameter family was first studied in Warrens [24]. Note that ψi(r,0)=ϕi(r); that is, if s=0 we obtain the functions from Example 1. Since ϕi(r)≤1 for all r (Lemma 2), we also have ψi(r,s)≤1 for all r,s. Furthermore, we can write ψi(r,s) as (pii+λi)/μi, where
(8)λi=s(1-pi+-p+i),(9)μi=rpi++(1-r)p+i+s(1-pi+-p+i).
Hence, ψi(r,s)∈Li for all r,s∈[0,1]. Several additional coefficients from the literature are special cases of ψi(r,s). Coefficient ψi(1/2,1/2) is the observed agreement of the collapsed 2×2 table associated with category i, while coefficients ψi(0,1) and ψi(1,1) are, respectively, the specificity and negative predictive value of category i.
Example 4.
For measuring validity in a 2×2 study, Bloch and Kraemer [26] proposed the weighted kappa coefficient. The coefficient is based on an acknowledgment that the clinical consequences of a false negative may be quite different from the clinical consequences of a false positive. A false negative may delay treatment of a patient, while a false positive may result in unnecessary treatment. The Bloch-Kraemer weighted kappa is unique in that it requires that a real number r∈[0,1] must be specified a priori indicating the relative importance of the false negatives to the false positives. For category i the weighted kappa is defined as [26, page 273]:
(10)κi(r)=pii-pi+p+irpi+(1-p+i)+(1-r)(1-pi+)p+i.
For all r, coefficient κi(r) can be used in the context of the utility of association [26]. Coefficient (10) is a asymmetric special case of the weighted kappa proposed in Cohen [27]. The latter weighted kappa is widely used with agreement tables with three or more ordinal categories [28–30].
Coefficient κi(1/2) is the ordinary Cohen’s kappa for the 2×2 table associated with category i. It is a standard tool in a 2×2 reliability study. It is sometimes called the reliability of category i [13, 14]. Coefficient κi(1) is the coefficient of conditional agreement proposed in Coleman [31] (see [32, page 367], and [33, page 397]). This coefficient can be used if one is interested in the agreement between the observers for those subjects which the first observer assigned to category i.
Since
(11)rpi+(1-p+i)+(1-r)(1-pi+)p+i=rpi+-rpi+p+i+(1-r)p+i-pi+p+i+rpi+p+i=rpi++(1-r)p+i-pi+p+i,
we can write (10) as
(12)κi(r)=pii-pi+p+irpi++(1-r)p+i-pi+p+i.
We can write (12) as (pii+λi)/μi, where(13a)λi=-pi+p+i,(13b)μi=rpi++(1-r)p+i-pi+p+i.
Hence, κi(r)∈Li for all r∈[0,1].
Example 5.
For the 2×2 table associated with category i, the intraclass kappa [26, page 276] can be defined as
(14)πi=pii-((pi++p+i)/2)2(pi++p+i)/2-((pi++p+i)/2)2.
The letter π was originally used by Scott [9]. Bloch and Kraemer [26] showed that this coefficient can be used in the context of agreement. The intraclass kappa satisfies the classical definition of reliability [15, 18]. We can write (14) as (pii+λi)/μi, where(15a)λi=-(pi++p+i2)2,(15b)μi=pi++p+i2-(pi++p+i2)2.
Hence, πi∈Li.
2.3. Examples of Overall Coefficients
Coefficients in the sets Li for i∈{1,2,…,m} only describe the information of one category at a time. Other association coefficients summarize the information in all categories at once. Let
(16)λ=λ(p1+,…,pm+,p+1,…,p+m),μ=μ(p1+,…,pm+,p+1,…,p+m)
be functions of the marginal totals and define the set
(17)L={A:M⟶R∣A=∑ipii+λμ,A≤1}.
Given fixed marginal totals the coefficient space L consists of all linear transformations of the overall observed agreement ∑ipii. Clearly, ∑ipii is an element of L. Other examples are Cohen’s kappa and Scott’s pi. The population value of Cohen’s kappa is defined as [34]
(18)κ=∑ipii-∑ipi+p+i1-∑ipi+p+i.
The numerator of kappa is the difference between the actual probability of agreement and the probability of agreement in the case of statistical independence of the ratings. The denominator of kappa is the maximum possible value of the numerator. Kappa has value 1 when there is perfect agreement between the observers, 0 when agreement is equal to that expected by chance, and a negative value when agreement is less than that expected by chance. We can write kappa as (∑ipii+λ)/μ, where(19a)λ=-∑ipi+p+i,(19b)μ=1-∑ipi+p+i.
The population value of Scott’s pi is defined as [2, 9, 11](20)π=∑ipii-∑i((pi++p+i)/2)21-∑i((pi++p+i)/2)2.
The differences in the definitions of agreement under chance are discussed in Examples 9 and 10 in the next section. We always have the inequality κ≥π [3].
3. Correction for Chance
In this section we define the correction for chance function. The expectation E(A) of a coefficient A is conditionally upon fixed marginal totals. The correction for chance function is denoted by C. For A∈Li it is defined as
(21)C:Li⟶Li,A⟼A-E(A)1-E(A).
For an association coefficient A∈L the correction for chance function is defined as
(22)C:L⟶L,A⟼A-E(A)1-E(A).
The short formula is in both cases given by [3, 22, 35]
(23)C(A)=A-E(A)1-E(A).
We assume in (23) that E(A)<1 to avoid indeterminacy. Lemma 6 presents an alternative expression for C(A) if A∈Li.
Lemma 6.
Let A∈Li with A=(pii+λi)/μi. One has
(24)C(A)=pii-E(pii)μi-λi-E(pii).
Proof.
Let A∈Li with A=(pii+λi)/μi. Since E is a linear operator we have
(25)E(A)=E(pii)+λiμi.
Using A and E(A) in (25) in (23) and multiplying all terms of the result by μi, we obtain the expression in (24).
Lemma 7 presents an alternative expression for C(A) if A∈L. The proof of Lemma 7 is similar to the proof of Lemma 6.
Lemma 7.
Let A∈L with A=(∑ipii+λ)/μ. One has
(26)C(A)=∑i(pii-E(pii))μ-λ-∑iE(pii).
The function C is a map from Li to Li if Li is closed under C. Lemma 8 shows that this is the case.
Lemma 8.
The spaces Li and L are closed under C.
Proof.
We present the proof for A∈Li only. The proof for A∈L follows from using similar arguments.
Let A∈Li with A=(pii+λi)/μi. The formula for C(A) is presented in (24). Since E(pii) is a function of the marginal totals pi+ and p+i we can write C(A) as (pii+λi*)/μi*, where(27a)λi*=-E(pii),(27b)μi*=μi-λi-E(pii).
Hence, C(A)∈Li, and the result follows.
Formula (24) shows that elements of Li coincide after correction for chance if they have the same difference μi-λi, regardless of the choice of E(pii). This suggests the following definition. Two coefficients A1,A2∈Li are said to be equivalent with respect to (24), denoted by A1~A2, if they have the same difference μi-λi. It can be shown that ~ is an equivalence relation on Li. The equivalence relation ~ divides the elements of Li into equivalence classes, one class for each value of the difference μi-λi.
Different definitions of E(pii) provide different versions of the correction for chance formula. We consider two examples of E(pii). Additional examples can be found in [2, 3, 11, 22].
Example 9.
The expected value of pii under statistical independence is given by
(28)E(pii)=pi+p+i.
In this case we assume that the data are a product of chance concerning two different frequency distributions.
Example 10.
Alternatively, we may assume that the data are a product of chance concerning a single frequency distribution [9, 11]. The common parameter is usually estimated by the arithmetic mean of the marginals totals pi+ and p+i. Hence, in this case we have
(29)E(pii)=(pi++p+i2)2.
Lemma 11 presents an application of the correction for chance function. In Lemma 11 the function is combined with Example 9. The result shows how the functions in Examples 1 and 3 are related to the function κi(r) in Example 4.
Lemma 11.
Assume (28) holds. Then C(ψi(r,s))=κi(r) for all r and s.
Proof.
Using λi and μi in (8) and (9) we have
(30)μi-λi=rpi++(1-r)p+i.
Using (28) and (30) in (24) we obtain κi(r) in (12).
4. Averaging over Categories
In this section we define a function that connects the association coefficients in the coefficient spaces L1,L2,…,Lm to the coefficients in the space L. For i∈{1,2,…,m} let Ai∈Li with Ai=(pii+λi)/μi. For these m coefficients we define the function
(31)W:L1×L2×⋯×Lm⟶L,(A1,A2,…,Am)⟼∑i(pii+λi)∑iμi,
or
(32)W(A1,A2,…,Am)=∑i(pii+λi)∑iμi.
Thus, W(A1,A2,…,Am) is the weighted average of the Ai using the denominators μi of the Ai as weights. This weighted average is similar to the arithmetic mean of the category coefficients. In the calculation of the arithmetic mean each category coefficient contributes equally to the final average. In the calculation of W some category coefficients contribute more than others. We check whether function (32) is well-defined.
Lemma 12.
Function (32) is well-defined.
Proof.
It must be shown that
(33)A=∑ipii+∑iλi∑iμi
is an element of L. Since λi and μi each are functions of the marginal totals pi+ and p+i, the sums ∑iλi and ∑iμi are also functions of the marginal totals. Hence, we can write A=(∑ipii+∑iλi)/∑iμi as A=(∑ipii+λ)/μ, where(34a)λ=∑iλi,(34b)μ=∑iμi,
from which the result follows.
In the remainder of this section we consider some results associated with the weighted average function in (32). If we fix r, then (5) provides m association coefficients for describing the agreement between the observers, one for each category. Lemma 13 shows that a weighted average of these coefficients is equivalent to the overall observed agreement ∑ipii, regardless of the value of r.
Lemma 13.
Let r∈[0,1] be fixed. One has
(35)W(ϕ1(r),ϕ2(r),…,ϕm(r))=∑ipii.
Proof.
The formula of W is presented in (32). Using λi and μi in (6a) and (6b) we have
(36)∑i(pii+λi)=∑ipii,
and, using identity (2),
(37)∑iμi=r∑ipi++(1-r)∑ip+i=r+1-r=1.
If we fix r, then (12) provides us with m Bloch-Kraemer weighted kappas for describing the agreement between the observers, one for each category. Lemma 14 shows that a weighted average of these coefficients is equivalent to Cohen’s kappa in (18), regardless of our choice of r.
Lemma 14.
Let r∈[0,1] be fixed. One has
(38)W(κ1(r),κ2(r),…,κm(r))=κ.
Proof.
The formula of W is presented in (32). Using λi and μi in (13a) and (13b) we have
(39)∑i(pii+λi)=∑i(pii-pi+p+i),
which is the numerator of κ, and, using identity (2),
(40)∑iμi=∑i(rpi++(1-r)p+i-pi+p+i)=r∑ipi++(1-r)∑ip+i-∑ipi+p+i=r+1-r-∑ipi+p+i=1-∑ipi+p+i,
which is the denominator of κ.
Lemma 15 shows that if we apply W to the intraclass kappas πi in Example 5 then we obtain Scott’s pi.
Lemma 15.
One has
(41)W(π1,π2,…,πm)=π.
Proof.
The formula of W is presented in (32). Using λi and μi in (15a) and (15b) we have
(42)∑i(pii+λi)=∑ipii-∑i(pi++p+i2)2,
which is the numerator of π, and, using identity (2),
(43)∑iμi=∑ipi++p+i2-∑i(pi++p+i2)2=1-∑i(pi++p+i2)2,
which is the denominator of π.
5. Composition of Functions
In Sections 3 and 4 we studied the correction for chance function and the weighted average function separately. In this section we study the composition of the two functions. Lemma 16 shows that the two functions commute. Hence, changing the order of the functions does not change the result.
Lemma 16.
For i∈{1,2,…,m} let Ai∈Li with Ai=(pii+λi)/μi. One has
(44)W(C(A1),C(A2),…,C(Am))=C(W(A1,A2,…,Am)).
Proof.
We will show that both compositions are equivalent to
(45)∑i(pii-E(pii))∑i(μi-λi)-∑iE(pii).
The formula for the C(Ai) is presented in (24). Adding the numerators of (24) we obtain the numerator of (45) and adding the denominators of (24) we obtain the denominator of (45). Hence, W(C(A1),C(A2),…,C(Am)) is equivalent to (45).
The formula for W(A1,A2,…,Am) is presented in (32). The coefficient can be written as (∑ipii+λ)/μ, where λ and μ are presented in (34a) and (34b). Using this λ and μ in (26) we also obtain (45).
Lemma 16 shows that we can either take the average of the chance-corrected versions of coefficients A1,A2,…,Am or take a weighted average of coefficients and then correct the overall coefficient for agreement due to chance. The result will be the same. Coefficient (45) contains two quantities that must be specified, namely, the expectation E(pii) and the sum of the differences μi-λi. Using, for fixed r, λi and μi in (6a) and (6b), (8) and (9), (13a) and (13b), or (15a) and (15b) we obtain
(46)∑i(μi-λi)=1.
Identity (46) shows that all coefficients discussed in Section 2 belong to a specific family of linear transformations. An example of a coefficient that does not belong to this family is the phi coefficient in (50). For other examples, see [22].
Using identity (46) in (45) we obtain the overall coefficient
(47)∑i(pii-E(pii))1-∑iE(pii).
If we use E(pii) in (28) in (47) we obtain Cohen’s kappa, whereas if we use E(pii) in (29) in (47) we obtain Scott’s pi. The overall kappa is not a weighted average of phi coefficients.
6. A Numerical Illustration
In this section we present a numerical illustration of Lemma 14, which shows that for fixed r Cohen’s kappa is a weighted average of the Bloch-Kraemer weighted kappas associated with each category. Let nij denote the observed number of subjects that are classified into category i by the first observer and into category j by the second observer. Assuming a multinominal sampling model with the total numbers of subjects n fixed, the maximum likelihood estimate of the cell probability pij is given by p^ij=nij/n. We obtain the maximum likelihood estimates κ^i(r) and κ^ by replacing the cell probabilities pij by the p^ij in the Bloch-Kraemer weighted kappas in (12) and Cohen’s kappa in (18) [33, page 396]. Let
(48)θ1=∑i=1mp^ii,θ3=∑i=1mp^ii(p^i++p^+i),θ2=∑i=1mp^i+p^+i,θ4=∑i=1m∑j=1mp^ij(p^+i+p^j+)2.
The approximate large sample variance of κ^ [33, 34, 36] is given by
(49)σ2(κ^)=1n[θ1(1-θ1)(1-θ2)2+2(1-θ1)(2θ1θ2-θ3)(1-θ2)3(1-θ1)2(θ4-4θ22)(1-θ2)4sssssss+(1-θ1)2(θ4-4θ22)(1-θ2)4].
The product-moment correlation coefficient or phi coefficient for the 2×2 table associated with category i is given by
(50)ρi=p^ii-p^i+p^+ip^i+(1-p^i+)p^+i(1-p^+i).
The asymptotic variance [26, page 279] of κ^i(r) is given by
(51)σ2(κ^i(r))=p^i+(1-p^i+)p^+i(1-p^+i)n[rp^i+(1-p^+i)+(1-r)(1-p^i+)p^+i]2·V,
where
(52)V=1+4Ui+U+iρi-(1+3Ui+2+3U+i2)ρi2+2Ui+U+iρi3,Ui+=(1/2)-p^i+p^i+(1-p^i+),U+i=(1/2)-p^+ip^+i(1-p^+i).
To illustrate Lemma 14 we consider the data in Table 2 taken from Fennig et al. [20]. These authors investigated the accuracy of clinical diagnosis in psychotic patients. As a gold standard they used the ratings of two project psychiatrists, called the research diagnosis. Table 2 presents the cross-classification of the research and clinical diagnoses. The estimate of the overall kappa for these data is κ^=0.432 with 95% confidence interval (0.341–0.522), indicating a moderate overall level of agreement. Table 3 presents the estimates of the Bloch-Kraemer weighted kappas for the four categories, labeled S, B, D, and O, for five distinct values of r. The table also presents the associated 95% confidence intervals between parentheses.
Research and clinical diagnoses of disorders in 223 psychotic patients [20].
Research diagnosis
Clinical diagnosis
Schizophrenia
Bipolar disorder
Depression
Other
Total
Schizophrenia
40
6
4
15
65
Bipolar disorder
4
25
1
5
35
Depression
4
2
21
9
36
Other
17
13
12
45
87
Total
65
46
38
74
223
Bloch-Kraemer weighted kappas for categories Schizophrenia, Bipolar disorder, Depression, and Other, for r∈{0,1/3,1/2,2/3,1}.
r
Schizophrenia
Bipolar disorder
Depression
Other
κ^S(r)
95% CI
κ^B(r)
95% CI
κ^D(r)
95% CI
κ^O(r)
95% CI
0
0.457
(0.330–0.585)
0.458
(0.339–0.578)
0.467
(0.318–0.616)
0.357
(0.213–0.503)
13
0.457
(0.330–0.585)
0.506
(0.375–0.639)
0.476
(0.325–0.629)
0.326
(0.194–0.459)
12
0.457
(0.330–0.585)
0.534
(0.396–0.674)
0.482
(0.328–0.636)
0.312
(0.186–0.440)
23
0.457
(0.330–0.585)
0.565
(0.419–0.713)
0.487
(0.332–0.643)
0.300
(0.178–0.422)
1
0.457
(0.330–0.585)
0.640
(0.474–0.807)
0.498
(0.339–0.657)
0.277
(0.165–0.390)
The statistics for category Schizophrenia in Table 3 are equivalent for all values of r because p^S+=p^+S=65/223=0.291. We have κ^S(r)=0.457 with 95% confidence interval (0.330–0.585), indicating a moderate level of agreement on Schizophrenia. The level of agreement on the other categories depends on the value of r. The agreement on categories Bipolar disorder and Depression is higher than that of Schizophrenia for all values of r, while the agreement on category Other is lowest for all values of r. Finally, recall that, for fixed r, the overall kappa is a weighted average of the Bloch-Kraemer weighted kappas. For example, for r=0 we have
(53)((0.207)(0.457)+(0.174)(0.458)s+(0.143)(0.467)+(0.202)(0.357))×(0.207+0.174+0.143+0.202)-1=0.432,
and for r=2/3 we have
(54)((0.207)(0.457)+(0.141)(0.565)s+(0.137)(0.487)+(0.241)(0.300))×(0.207+0.141+0.137+0.241)-1=0.432.
The data in Tables 2 and 3 show that if we use the same category coefficients for all categories, then the coefficients in general produce different values. This observation holds for almost all real life data. Table 4 presents a hypothetical data set with three nominal categories. Table 5 presents the corresponding estimates of the Bloch-Kraemer weighted kappas for the three categories, labeled A, B, and C, for five distinct values of r and the associated 95% confidence intervals. The statistics for category B in Table 5 are equivalent for all values of r because p^B+=p^+B=120/174=0.690. The estimate of the overall kappa for these data is κ^=0.356 with 95% confidence interval (0.229–0.482). Furthermore, all the estimates of the category kappas κ^i(1/2) have the same value 0.356. Thus, in this hypothetical case the overall kappa is a perfect summary coefficient of the three category kappas. Due to Lemma 14, we know that the overall kappa also roughly summarizes the other Bloch-Kraemer weighted kappas. However, these weighted kappas have quite distinct values. These data illustrate that while the overall kappa is always a summary coefficient of all types of Bloch-Kraemer category kappas, it can be a perfect summary coefficient for a particular type of weighted kappas. On the contrary, while the overall kappa may summarize one type of category coefficients perfectly, it can still be a poor summary coefficient for other types of category coefficients.
Hypothetical diagnoses of three disorders in 174 psychotic patients.
Diagnosis I
Diagnosis II
Total
A
B
C
Type A
12
0
6
18
Type B
24
96
0
120
Type C
0
24
12
36
Total
36
120
18
174
Bloch-Kraemer weighted kappas for categories A, B, and C, for r∈{0,1/3,1/2,2/3,1}.
r
A
B
C
κ^A(r)
95% CI
κ^B(r)
95% CI
κ^C(r)
95% CI
0
0.256
(0.139–0.374)
0.356
(0.208–0.504)
0.580
(0.315–0.846)
13
0.315
(0.171–0.460)
0.356
(0.208–0.504)
0.408
(0.221–0.596)
12
0.356
(0.193–0.519)
0.356
(0.208–0.504)
0.356
(0.193–0.519)
23
0.408
(0.221–0.596)
0.356
(0.208–0.504)
0.315
(0.171–0.460)
1
0.580
(0.315–0.846)
0.356
(0.208–0.504)
0.256
(0.139–0.374)
7. Conclusion
Cohen’s kappa is a commonly used association measure for summarizing agreement between two observers on a nominal scale. The coefficient reduces the ratings of the two observers to a single real number. In general, this leads to a substantial loss of information. A more complete picture of the interobserver agreement is obtained by assessing the degree of agreement on the individual categories [18]. There are various association coefficients that can be used to describe the information for each category separately. Examples are the sensitivity and specificity of a category, the positive predictive value, negative predictive value, and the Bloch-Kraemer weighted kappa. Once we have selected a category coefficient we have multiple coefficients describing the agreement between the observers, one for each category. If one is interested in a single number that roughly summarizes the agreement between the observers, what overall coefficient should be used? The results derived in this paper show that the overall observed agreement, Cohen’s kappa, and Scott’s pi are proper overall coefficients. Each coefficient is a weighted average of certain category coefficients and therefore its value lies somewhere between the minimum and maximum of the category coefficients. We enumerate some of the new interpretations that were found.
Suppose each category coefficient is the same special case of the function in (5). Examples are the sensitivity, positive predictive value, and the Dice coefficient. The observed agreement is a weighted average of the category coefficients (Lemma 13).
Suppose that each category coefficient is the same Bloch-Kraemer weighted kappa in (12). Then Cohen’s kappa is a weighted average of the weighted kappas (Lemma 14).
Suppose that each category coefficient is the intraclass kappa in (14). Then Scott’s pi is a weighted average of the intraclass kappas (Lemma 15).
Suppose that the value of a coefficient under chance is the value under statistical independence. Furthermore, suppose that each category coefficient is the same special case of the general function in (7). Examples are the sensitivity, specificity, positive predictive value, negative predictive value, the observed agreement, and the Dice coefficient. Then Cohen’s kappa is both a weighted average of the chance-corrected category coefficients and a chance-corrected version of a weighted average of the category coefficients (Lemma 16).
An illustration of Lemma 14 was presented in Section 6. The lemmas presented in this paper show that there is an abundance of category coefficients of which the observed agreement and Cohen’s kappa are summary coefficients. The results provide a basis for using these overall coefficients if one is only interested in a single number that roughly summarizes the agreement between the observers. If, on the other hand, one is interested in understanding the patterns of agreement and disagreement, one can report various category coefficients for the individual categories or consider log-linear or latent class models that can be used to model the agreement [19].
Conflict of Interests
The author declares that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This research is part of Veni Project 451-11-026 funded by The Netherlands Organisation for Scientific Research.
HsuL. M.FieldR.Interrater agreement measures: comments on Kappan, Cohen's Kappa, Scott's π, and Aickin's α20032205219KrippendorffK.Reliability in content analysis: some common misconceptions and recommendations200430341143310.1093/hcr/30.3.4112-s2.0-4043124032WarrensM. J.Inequalities between kappa and kappa-like statistics for K x K tables201075117618510.1007/s11336-009-9138-8MR26097362-s2.0-77950076141CohenJ.A coefficient of agreement for nominal scales196020374610.1177/001316446002000104HanleyJ. A.Standard error of the kappa statistic1987102231532110.1037/0033-2909.102.2.3152-s2.0-0000165343MaclureM.WillettW. C.Misinterpretation and misuse of the Kappa statistic198712621611692-s2.0-0023250550WarrensM. J.On the equivalence of Cohen's kappa and the Hubert-Arabie adjusted Rand index200825217718310.1007/s00357-008-9023-7MR24761102-s2.0-58149498320WarrensM. J.Cohen's kappa can always be increased and decreased by combining categories20107667367710.1016/j.stamet.2010.05.003MR2728420ZBL1232.621612-s2.0-77956601127ScottW. A.Reliability of content analysis: the case of nominal scale coding195519332132510.1086/2665772-s2.0-33748063869KrippendorffK.20042ndThousands Oaks, Calif, USASageKrippendorffK.Association, agreement, and equity198721210912310.1007/BF001676032-s2.0-4444293059FleissJ. L.Measuring agreement between two judges on the presence or absence of a trait1975313651659MR03886662-s2.0-0016750401FleissJ. L.1981New York, NY, USAWileyMR622544FleissJ. L.LevinB.PaikM. C.20033rdNew York, NY, USAJohn Wiley & Sons10.1002/0471445428MR2001202KraemerH. C.Ramifications of a population model for κ as a coefficient of reliability197944446147210.1007/BF022962082-s2.0-0001586045VanbelleS.AlbertA.Agreement between two independent groups of raters200974347749110.1007/s11336-009-9116-1MR25516722-s2.0-63649114878WarrensM. J.Cohen's kappa is a weighted average20118647348410.1016/j.stamet.2011.06.002MR28340342-s2.0-80052677779KraemerH. C.PeriyakoilV. S.NodaA.Kappa coefficients in medical research200221142109212910.1002/sim.11802-s2.0-0037199386AgrestiA.Modelling patterns of agreement and disagreement19921220121810.1177/0962280292001002052-s2.0-0026959051FennigS.CraigT. J.Tanenberg-KarantM.BrometE. J.Comparison of facility and research diagnoses in first-admission psychotic patients199415110142314292-s2.0-0028069949BaulieuF. B.A classification of presence/absence based dissimilarity coefficients19896223324610.1007/BF01908601MR10407422-s2.0-0010033545WarrensM. J.On similarity coefficients for 2×2 tables and correction for chance200873348750210.1007/s11336-008-9059-yMR24473272-s2.0-52949107579WarrensM. J.On association coefficients for 2×2 tables and properties that do not depend on the marginal distributions200873477778910.1007/s11336-008-9070-3MR2469792ZBL1284.627622-s2.0-62949147818WarrensM. J.Chance-corrected measures for 2 × 2 tables that coincide with weighted kappa201164235536510.1348/2044-8317.002001MR28167842-s2.0-79954596997DiceL. R.Measures of the amount of ecologic association between species19452629730210.2307/1932409BlochD. A.KraemerH. C.2 x 2 kappa coefficients: measures of agreement or association198945126928710.2307/25320522-s2.0-0024513436CohenJ.Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit196870421322010.1037/h00262562-s2.0-58149412516WarrensM. J.Some paradoxical results for the quadratically weighted kappa201277231532310.1007/s11336-012-9258-4MR2909432ZBL1284.627642-s2.0-84858451115WarrensM. J.Equivalences of weighted kappas for multiple raters20129340742210.1016/j.stamet.2011.11.001MR28714412-s2.0-84855201526WarrensM. J.Conditional inequalities between Cohen's kappa and weighted kappas201310142210.1016/j.stamet.2012.05.004MR29748062-s2.0-84863309440ColemanJ. S.Measures of concordance or consensus between members of social groupsJohns Hopkins University, 1966LightR. J.Measures of response agreement for qualitative data: some generalizations and alternatives197176536537710.1037/h00316432-s2.0-0001587595BishopY. M. M.FienbergS. E.HollandP. W.1975Cambridge, UKMIT PressAgrestiA.20022ndHoboken, NJ, USAWiley-InterscienceWiley Series in Probability and Statistics10.1002/0471249688MR1914507AlbatinehA. N.Niewiadomska-BugajM.MihalkoD.On similarity indices and correction for chance agreement200623230131310.1007/s00357-006-0017-zMR22959242-s2.0-33750732338FleissJ. L.CohenJ.EverittB. S.Large sample standard errors of kappa and weighted kappa196972532332710.1037/h00281062-s2.0-33645066726