Cohen’s kappa is a popular descriptive statistic for summarizing agreement between the classifications of two raters on a nominal scale. With m≥3 raters there are several views in the literature on how to define agreement. The concept of g-agreement (g∈{2,3,…,m}) refers to the situation in which it is decided that there is agreement if g out of m raters assign an object to the same category. Given m≥2 raters we can formulate m−1 multirater kappas, one based on 2-agreement, one based on 3-agreement, and so on, and one based on m-agreement. It is shown that if the scale consists of only two categories the multi-rater kappas based on 2-agreement and 3-agreement are identical.
1. Introduction
In social sciences and medical research it is frequently required that a group of objects is rated on a nominal scale with two or more categories. The raters may be pathologists that rate the severity of lesions from scans, clinicians who classify children on asthma severity, or competing diagnostic devices that classify the extent of disease in patients. Because there is often no golden standard, analysis of the interrater data provides a useful means of assessing the reliability of the rating system. Therefore, researchers often require that the classification task is performed by m≥2 raters. A standard tool for the analysis of agreement in a reliability study with m=2 raters is Cohen’s kappa [5, 28, 34], denoted by κ [2, 12]. The value of Cohen’s κ is 1 when perfect agreement between the two raters occurs, 0 when agreement is equal to that expected under independence, and negative when agreement is less than expected by chance. A value ≥.60 may indicate good agreement, whereas a value ≥.80 may even indicate excellent agreement [4, 16]. A variety of extensions of Cohen’s κ have been developed [19]. These include kappas for groups of raters [24, 25], kappas for multiple raters [15, 29], and weighted kappas [26, 30, 31]. This paper focuses on kappas for m≥2 raters making judgments on a binary scale.
With multiple raters there are several views on how to define agreement [13, 21, 22]. One may decide that there is only agreement if all m raters assign a subject to the same category (see, e.g., [27]). This type of agreement is referred to as simultaneous agreement, m-agreement, or DeMoivre’s definition of agreement [13]. Since only one deviating rating of a subject will lead to the conclusion that there is no agreement with respect to the subject, m-agreement looks especially useful in case the researchers demands are extremely high [22]. Alternatively, a researcher may decide that there is already agreement if any two raters categorize an object consistently. In this case we speak of pairwise agreement or 2-agreement. Conger [6] argued that agreement among raters can actually be considered to be an arbitrary choice along a continuum ranging from 2-agreement to m-agreement. The concept of g-agreement with g∈{2,3,…,m} refers to the situation in which it is decided that there is agreement if g out of m raters assign an object to the same category [6].
Given m≥2 raters we can formulate m-1 multirater kappas, one based on 2-agreement, one based on 3-agreement, and so on, and one based on m-agreement. Although all these kappas can be defined from a mathematical perspective, the multirater kappas in general produce different values (see, e.g., [32, 33]). The difficulty for a researcher is to decide which form of g-agreement should be used in case one is looking for agreement between ratings when the raters are assumed to be equally skilled. Popping [22] notes that in a considerable part of the literature multirater kappas based on 2-agreement are used. Conger [6] notes that especially coefficients based on 3-agreement may be useful in case the researchers demands are slightly higher. Stronger forms of g-agreement may in many practical situations be too demanding. However, it turns out that with ratings on a dichotomous scale the multirater kappas based on 2-agreement and 3-agreement are equivalent. This fact is proved in Section 3. First, Section 2 is used to introduce notation and present definitions of 2-, 3-, and 4-agreement. The multirater kappas and the main result are then presented in Section 3. Section 4 contains a discussion.
2. 2-, 3- and 4-Agreement
In this section we consider quantities of g-agreement for g∈{2,3,4}. Suppose that m≥2 observers each rate the same set of n objects (individuals and observations) on a dichotomous scale. The two categories are labeled 0 and 1, meaning, for example, presence and absence of a trait or a symptom. So, the data consist of m binary variables X1,…,Xm of length n. Let a,b,c,d∈{0,1}, let i,j,k,ℓ∈{1,2,…,m}, and let fia denote the number of times rater i used category a. Furthermore, let fijab denote the number of times rater i assigned an object to category a and rater j assigned an object to category b. The quantities fijkabc and fijkℓabcd are defined analogously. For notational convenience we will work with the relative frequencies pia=fia/n, pijab=fijab/n, pijkabc=fijkabc/n, and pijkℓabcd=fijkℓabcd/n.
For illustrating the concepts and results presented in this paper we use the study presented in O’Malley et al. [20]. In this study four pathologists (raters 1, 3, 5, and 8 in Figure 6 in [20]) examined images from 30 columnar cell lesions of the breast with low-grade/monomorphic-type cytologic atypia. The pathologists were instructed to categorize each as either “Flat Epithelial Atypia” (coded 1) or “Not Atypical” (coded 0). The results for each rater for all 30 cases are presented in Table 1. The 4 columns labeled 1 to 4 of Table 1 contain the ratings of the pathologists. The frequencies in the first column of Table 1 indicate how many times on a total of 30 cases a certain pattern of ratings occurred. Only five of all theoretically possible 24=16 patterns of 1s and 0s are observed in these data. Values of various multirater kappas for these data are presented on the right-hand side of the table. The formulas of the multirater kappas are presented in Section 3.
Ratings by 4 pathologists for 30 cases where 1 = Flat Epithelial Atypia and 0 = Not Atypical.
Freq.
Raters
1
2
3
4
10
1
1
1
1
κ(4,2)≈.802479
2
1
0
1
0
2
1
0
0
0
κ(4,3)≈.802479
1
0
0
0
1
15
0
0
0
0
κ(4,4)≈.802076
We can think of the four proportions pij00, pij01, pij10 and pij11 as the elements of a 2×2 table that summarizes the 2-agreement between raters i and j [10]. Proportions pij00, pij01, pij10, and pij11 are quantities of 2-agreement, because they describe information between a pair of raters. In general we have
(2.1)pij00+pij01+pij10+pij11=1.
Summing over the rows of this 2×2 table we obtain the marginal totals pi0 and pi1 corresponding to rater i.
Example 2.1.
For raters 1 and 2 in Table 1 we have
(2.2)p1200=15+130=815,p1201=0,p1210=2+230=215,p1211=1030=13,p1200+p1201+p1210+p1211=815+215+13=1,
illustrating identity (2.1). The marginal totals
(2.3)p10=815,p11=215+13=715,p20=815+215=23,p21=13
indicate how often raters 1 and 2, used the categories 0 and 1.
We can think of the eight proportions pijk000,pijk001,…,pijk110,pijk111 as the elements of a 2×2×2 table that summarizes the 3-agreement between raters i, j and k. We have
(2.4)pijk000+pijk001+pijk010+pijk100+pijk011+pijk101+pijk110+pijk111=1.
Summing over the direction corresponding to rater k, the 2×2×2 table collapses into the 2×2 table for raters i and j.
Example 2.2.
For raters 1, 2 and 3 in Table 1 we have
(2.5)p123000=815,p123100=115,p123101=115,p123111=13,
and p123001=p123010=p123011=p123110=0. Furthermore, we have
(2.6)p123000+p123100+p123101+p123111=815+115+115+13=1,
illustrating identity (2.4).
The 2-agreement and 3-agreement quantities are related in the following way. For a,b∈{0,1} we have the identities(2.7a)pijab=pijkab0+pijkab1,(2.7b)pikab=pijka0b+pijka1b,(2.7c)pjkab=pijk0ab+pijk1ab.For example, we have p1210=p123100+p123101=1/15+1/15=2/15. Moreover, we have an analogous set of identities for products of the marginal totals. That is, for a,b∈{0,1} we have the identities(2.8a)piapjb=piapjbpk0+piapjbpk1,(2.8b)piapkb=piapj0pkb+piapj1pkb,(2.8c)pjapkb=pi0pjapkb+pi1pjapkb.Using the relations between the 2-agreement and 3-agreement quantities in (2.7a), (2.7b), and (2.7c) and (2.8a), (2.8b), and (2.8c) we may derive the following identities. Proposition 2.3 is used in the proof of the theorem in Section 3.
Proposition 2.3.
Consider three raters i, j, and k. One has
(2.9)pij00+pij11+pik00+pik11+pjk00+pjk11=2(pijk000+pijk111)+1,(2.10)pi0pj0+pi1pj1+pi0pk0+pi1pk1+pj0pk0+pj1pk1=2(pi0pj0pk0+pi1pj1pk1)+1.
Proof.
We can express the sum of the 2-agreement quantities:
(2.11)pij00+pij11+pik00+pik11+pjk00+pjk11,
in terms of 3-agreement quantities using the identities in (2.7a), (2.7b), and (2.7c). Doing this we obtain
(2.12)3pijk000+pijk001+pijk010+pijk100+pijk011+pijk101+pijk110+3pijk111.
Applying identity (2.4) to (2.12) we obtain identity (2.9). Using the identities in (2.8a), (2.8b), and (2.8c) identity (2.10) is obtained in a similar way.
We can think of the sixteen proportions pijkℓ0000,pijkℓ0001,…,pijkℓ1110,pijkℓ1111 as the elements of a 2×2×2×2 table that summarizes the 4-agreement between raters i, j, k, and ℓ. We have
(2.13)pijkℓ0000+pijkℓ0001+⋯+pijkℓ1110+pijkℓ1111=1.
Example 2.4.
For raters 1, 2, 3, and 4 in Table 1 we have
(2.14)p12340000=12,p12341000=115,p12340001=130,p12341010=115,p12341111=13.
The remaining 4-agreement quantities are zero. Furthermore, we have
(2.15)p12340000+p12341000+p12340001+p12341010+p12341111=12+115+130+115+13=1,
illustrating identity (2.13).
The 3-agreement and 4-agreement quantities are related in the following way. For a,b,c∈{0,1} we have the identities(2.16a)pijkabc=pijkℓabc0+pijkℓabc1,(2.16b)pijℓabc=pijkℓab0c+pijkℓab1c,(2.16c)pikℓabc=pijkℓa0bc+pijkℓa1bc(2.16d)pjkℓabc=pijkℓ0abc+pijkℓ1abc.For example, we have p123000=p12340000+p12340001=1/2+1/30=8/15. There is also an analogous set of identities for products of the marginal totals.
The identities in (2.16a), (2.16b), (2.16c), and (2.16d) do not lead to a result analogous to Proposition 2.3. We have however the following less general result.
Proposition 2.5.
Consider four raters i, j,k, and ℓ. Suppose
(2.17)pijkℓ1100=pijkℓ1010=pijkℓ1001=pijkℓ0110=pijkℓ0101=pijkℓ0011=0.
One has(2.18)pijk000+pijk111+pijℓ000+pijℓ111+pikℓ000+pikℓ111+pjkℓ000+pjkℓ111=3(pijkℓ0000+pijkℓ1111)+1.
Proof.
We can express the sum of the 3-agreement quantities
(2.19)pijk000+pijk111+pijℓ000+pijℓ111+pikℓ000+pikℓ111+pjkℓ000+pjkℓ111,
in terms of 4-agreement quantities using the identities in (2.16a), (2.16b), (2.16c), and (2.16d). Doing this we obtain
(2.20)4pijkℓ0000+pijkℓ0001+pijkℓ0010+pijkℓ0100+pijkℓ1000+pijkℓ1110+pijkℓ1101+pijkℓ1011+pijkℓ0111+4pijkℓ1111.
Combining (2.13) and (2.17) we obtain the identity
(2.21)pijkℓ0000+pijkℓ0001+pijkℓ0010+pijkℓ0100+pijkℓ1000+pijkℓ1110+pijkℓ1101+pijkℓ1011+pijkℓ0111+pijkℓ1111=1.
Applying (2.21) to (2.20) we obtain identity (2.18).
The 4-agreement quantities pi1pj1pk0pℓ0, pi1pj0pk1pℓ0, pi1pj0pk0pℓ1, pi0pj1pk1pℓ0, pi0pj1pk0pℓ1, and pi0pj0pk1pℓ1 are in general not zero. Even if we would require that condition (2.17) holds, we would not obtain an identity similar to (2.18) for the products of the marginal totals.
3. Kappas Based on 2-, 3-, and 4-Agreement
In this section we present the main result. We introduce Cohen’s κ [5] and three multirater kappas, one based on 2-agreement, one based on 3-agreement, and one based on 4-agreement. For two raters i and j Cohen’s κ is defined as
(3.1)κ=κ(2,2)=pij00+pij11-pi0pj0-pi1pj11-pi0pj0-pi1pj1.
Example 3.1.
For raters 1 and 2 in Table 1 we have
(3.2)κ=8/15+1/3-(8/15)(2/3)-(7/15)(1/3)1-(8/15)(2/3)-(1/3)(1/3)=1316=.8125.
There are several ways to generalize Cohen’s κ to the case of multiple raters. A kappa for m raters based on 2-agreement between the raters is given by
(3.3)κ(m,2)=∑i<jm(pij00+pij11-pi0pj0-pi1pj1)(m2)-∑i<jm(pi0pj0+pi1pj1).
The m in κ(m,2) denotes that this coefficient is a measure for m raters. The 2 in κ(m,2) denotes that the coefficient is a measure of 2-agreement, since the pij00 and pij11 describe information between pairs of raters.
Coefficient κ(m,2) is a special case of a multicategorical kappa that was first considered in Hubert [13] and has been independently proposed by Conger [6]. Hubert's kappa is also discussed in Davies and Fleiss [7], Popping [21], and Heuvelmans and Sanders [11]. Furthermore, Hubert’s kappa is a special case of the descriptive statistics discussed in Berry and Mielke [3] and Janson and Olssen [14]. Standard errors for κ(m,2) can be found in Hubert [13].
Example 3.2.
For the four raters in Table 1 we have
(3.4)∑i<j4(pij00+pij11)=16330,∑i<j4(pi0pj0+pi1pj1)=1409450κ(4,2)=163/30-1409/4506-(1409/450)=10361291≈.802479.
A kappa for m raters based on 3-agreement between the raters is given by
(3.5)κ(m,3)=∑i<j<km(pijk000+pijk111-pi0pj0pk0-pi1pj1pk1)(m3)-∑i<j<km(pi0pj0pk0+pi1pj1pk1).
For m=3 raters we have the special case
(3.6)κ(3,3)=pijk000+pijk111-pi0pj0pk0-pi1pj1pk11-pi0pj0pk0-pi1pj1pk1.
Coefficient κ(3,3) was first considered in Von Eye and Mun [8]. It is also a special case of the weighted kappa proposed in Mielke et al. [17, 18]. The coefficient is a measure of simultaneous agreement [18]. Standard errors for κ(3,3) can be found in [17, 18].
Example 3.3.
For the four raters in Table 1 we have
(3.7)∑i<j<k4(pijk000+pijk111)=10330,∑i<j<k4(pi0pj0pk0+pi1pj1pk1)=509450,κ(4,3)=103/30-509/4504-(509/450)=10361291≈.802479.
Interestingly, we have κ(4,2)=κ(4,3) (Example 3.2).
Examples 3.2 and 3.3 show that the multirater kappas based on 2-agreement and 3-agreement produces identical values for the data in Table 1. This equivalence is formalized in the following result.
Theorem 3.4.
κ(m,2)=κ(m,3) for all m.
Proof.
Given m raters, a pair of raters i and j occur m-2 times together in a triple of raters. Hence, using identities (2.9) and (2.10) we have
(3.8)(m-2)∑i<jm(pij00+pij11)=∑i<j<km[2(pijk000+pijk111)+1](m-2)∑i<jm(pi0pj0+pi1pj1)=∑i<j<km[2(pi0pj0pk0+pi1pj1pk1)+1].
Multiplying all terms in κ(m,2) by m-2, and using identities (3.8) in the result, we obtain
(3.9)2∑i<j<km(pijk000+pijk111-pi0pj0pk0-pi1pj1pk1)(m-2)(m2)-2∑i<j<km(pi0pj0pk0+pi1pj1pk1)-(m3).
Since
(3.10)(m-2)(m2)-(m3)=2·m(m-1)(m-2)6=2(m3),
in the denominator of (3.9), coefficient (3.9) is equivalent to κ(m,3).
Finally, a kappa for m raters based on 4-agreement between the raters is given by
(3.11)κ(m,4)=∑i<j<k<ℓm(pijkℓ0000+pijkℓ1111-pi0pj0pk0pℓ0-pi1pj1pk1pℓ1)(m4)-∑i<j<k<ℓm(pi0pj0pk0pℓ0+pi1pj1pk1pℓ1).
The special case κ(4,4) extends the kappa proposed in Von Eye and Mun [8] and Mielke et al. [17, 18].
Example 3.5.
For the four raters in Table 1 we have
(3.12)p12340000+p12341111=56andp10p20p30p40+p11p21p31p41=5333375κ(4,4)=5/6-533/33751-(533/3375)=45595684≈.802076.
Note that for these data we have κ(4,2)=κ(4,3)≠κ(4,4) (Examples 3.2 and 3.3), although the difference between the values of the multirater kappas is negligible.
4. Discussion
Cohen’s kappa is a standard tool for summarizing agreement ratings by two observers on a nominal scale. Cohen’s kappa can only be used for comparing m=2 raters at a time. Various authors have proposed extensions of Cohen's kappa for m≥2 raters. The concept of g-agreement with g∈{2,3,…,m} refers to the situation in which it is decided that there is agreement if g out of m raters assign an object to the same category [6, 22]. Given m≥2 raters we can formulate m-1 multirater kappas, one based on 2-agreement, one based on 3-agreement, and so on, and one based on m-agreement. Although all these kappas can be defined from a mathematical perspective, the multirater kappas in general produce different values (see, e.g., [32, 33]). In this paper we considered multirater kappas based on 2-, 3-, and 4-agreement for dichotomous ratings.
As the main result of the paper it was shown (Theorem 3.4, Section 3) that the popular concept of 2-agreement and the slightly more demanding but reasonable alternative concept of 3-agreement coincide for dichotomous (binary) scores, that is, the multirater kappas based on 2-agreement and 3-agreement are identical. Hence, for ratings on a dichotomous scale the problem of which form of agreement to use does not occur. The key properties for this equivalence are the relations between the 2-agreement and 3-agreement quantities in Proposition 2.3 (Section 2). The O’Malley et al. data in Table 1 and the hypothetical data in Table 2 show that 2/3-agreement is not equivalent to 4-agreement. This is because there is no result analogous to Proposition 2.3 between 2/3-agreement and 4-agreement quantities. The data examples in, for example, Warrens [32, 33] show that the equivalence also does not hold for multirater kappas for more than two categories. Furthermore, the data examples in Table 2 show that the 2/3-agreement and 4-agreement kappas can produce quite different values.
Two hypothetical data sets with dichotomous judgments by 4 raters for 15 cases.
Freq.
Raters
1
2
3
4
6
1
1
1
1
κ(4,2)≈.645
5
1
0
0
0
κ(4,3)≈.645
4
0
0
0
0
κ(4,4)≈.599
Freq.
Raters
1
2
3
4
6
1
1
1
1
κ(4,2)≈.564
5
1
0
1
0
κ(4,3)≈.564
4
0
0
0
0
κ(4,4)≈.625
Another statistic that is often regarded as a generalization of Cohen’s κ is the multirater statistic proposed in Fleiss [9]. Artstein and Poesio [1] however showed that this statistic is actually a multirater extension of Scott’s pi [23] (see also [22]). Using (pia+qja)2/4 instead of piapja in κ(m,2) we obtain a special case of the coefficient in Fleiss [9], which shows that the coefficient is a special case of Hubert’s kappa [6, 13, 29]. It is possible to formulate an analogous multirater pi coefficient based on 3-agreement. This pi coefficient is equivalent to the coefficient based on 2-agreement.
Acknowledgment
This paper is a part of project 451-11-026 funded by The Netherlands Organisation for Scientific Research.
ArtsteinR.PoesioM.Kappa3=Alpha (or beta)200505-1University of EssexBanerjeeM.CapozzoliM.McSweeneyL.SinhaD.Beyond kappa: a review of interrater agreement measures199927132310.2307/33154871703616ZBL0929.62117BerryK. J.MielkeP. W.A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters198848921933CicchettiD.BronenR.SpencerS.HautS.BergA.OliverP.TyrerP.Rating scales, scales of measurement, issues of reliability: resolving some critical issues for clinicians and researchers200619485575642-s2.0-3374732939510.1097/01.nmd.0000230392.83607.c5CohenJ.A coefficient of agreement for nominal scales196020374610.1177/001316446002000104CongerA. J.Integration and generalization of kappas for multiple raters19808823223282-s2.0-000105953410.1037/0033-2909.88.2.322DaviesM.FleissJ. L.Measuring agreement for multinomial data1982381047105110.2307/2529886ZBL0501.62045Von EyeA.MunE. Y.2006Lawrence Erlbaum AssociatesFleissJ. L.Measuring nominal scale agreement among many raters19717653783822-s2.0-334301947010.1037/h0031619FleissJ. L.Measuring agreement between two judges on the presence or absence of a trait1975313651659038866610.2307/2529549HeuvelmansA. P. J. M.SandersP. F.SandersP. F.EggenT. J. H. M.Beoordelaarsovereenstemming1993Arnhem, The NetherlandsCito Instituut voor Toestontwikkeling443470HsuL. M.FieldR.Interrater agreement measures: comments on kappan, Cohen’s kappa, Scott’s π and Aickin’s α2003220521910.1207/S15328031US0203_03HubertL.Kappa revisited19778422892972-s2.0-000007508510.1037/0033-2909.84.2.289JansonH.OlssonU.A measure of agreement for interval or nominal multivariate observations200161227728910.1177/001316401219712391816425LandisJ. R.KochG. G.An application of hierarchical kappatype statistics in the assessment of majority agreement among multiple observers197733363374LandisJ. R.KochG. G.The measurement of observer agreement for categorical data197733159174MielkeP. W.BerryK. J.JohnstonJ. E.The exact variance of weighted kappa with multiple raters200710126556602-s2.0-3684901591210.2466/PR0.101.2.655-660MielkeP. W.BerryK. J.JohnstonJ. E.Resampling probability values for weighted kappa with multiple raters200810226066132-s2.0-4514909541510.2466/PR0.102.2.606-613NelsonJ. C.PepeM. S.Statistical description of interrater variability in ordinal ratings2000954754962-s2.0-003452163910.1191/096228000701555262ZBL1121.62644O'MalleyF. P.MohsinS. K.BadveS.BoseS.CollinsL. C.EnnisM.KleerC. G.PinderS. E.SchnittS. J.Interobserver reproducibility in the diagnosis of flat epithelial atypia of the breast20061921721792-s2.0-3094444417210.1038/modpathol.3800514PoppingR.1983Groningen, The NetherlandsRijksuniversiteit GroningenPoppingR.Some views on agreement to be used in content analysis studies2010446106710782-s2.0-7795667025010.1007/s11135-009-9258-3ScottW. A.Reliability of content analysis: the case of nominal scale coding19551933213252-s2.0-3374806386910.1086/266577VanbelleS.AlbertA.Agreement between an isolated rater and a group of raters20096318210010.1111/j.1467-9574.2008.00412.x2656919VanbelleS.AlbertA.Agreement between two independent groups of raters200974347749110.1007/s11336-009-9116-12551672VanbelleS.AlbertA.A note on the linearly weighted kappa coefficient for ordinal scales20096215716310.1016/j.stamet.2008.06.0012649614ZBL1220.62172WarrensM. J.κ-adic similarity coefficients for binary (presence/absence) data200926222724510.1007/s00357-009-9032-12534149WarrensM. J.Inequalities between kappa and kappa-like statistics for κ×κ
tables201075117618510.1007/s11336-009-9138-82609736WarrensM. J.Inequalities between multi-rater kappas20104427128610.1007/s11634-010-0073-42748691WarrensM. J.Cohen's linearly weighted kappa is a weighted average of 2×2 kappas201176347148610.1007/s11336-011-9210-z2823010WarrensM. J.Weighted kappa is higher than Cohen's kappa for tridiagonal agreement tables20118226827210.1016/j.stamet.2010.09.0042769285ZBL1213.62187WarrensM. J.Equivalences of weighted kappas for multiple raters20129340742210.1016/j.stamet.2011.11.0012871441WarrensM. J.A family of multi-rater kappas that can always be increased and decreased by combining categories20129333034010.1016/j.stamet.2011.08.0082871435WarrensM. J.Conditional inequalities between Cohen's kappa and weighted kappas201310142210.1016/j.stamet.2012.05.004