Lorem ipsum dolor sit amet, consectetur adipisicing elit. Odit molestiae mollitia laudantium assumenda nam eaque, excepturi, soluta, perspiciatis cupiditate sapiente, adipisci quaerat odio voluptates consectetur nulla eveniet iure vitae quibusdam? Excepturi aliquam in iure, repellat, fugiat illum voluptate repellendus blanditiis veritatis ducimus ad ipsa quisquam, commodi vel necessitatibus, harum quos a dignissimos.
Close Save changesHelp F1 or ? Previous Page ← + CTRL (Windows) ← + ⌘ (Mac) Next Page → + CTRL (Windows) → + ⌘ (Mac) Search Site CTRL + SHIFT + F (Windows) ⌘ + ⇧ + F (Mac) Close Message ESC
Another hypothesis of interest is to evaluate to what degree two different examiners agree on different systems of evaluation. This has important applications in medicine where two physicians may be called upon to evaluate the same group of patients for further treatment.
The Cohen's Kappa statistic (or simply kappa) is intended to measure agreement between two variables.
Recall the example on movie ratings from the introduction. Do the two movie critics, in this case, Siskel and Ebert, classify the same movies into the same categories; do they really agree?
Siskel | Ebert | |||
---|---|---|---|---|
con | mixed | pro | total | |
con | 24 | 8 | 13 | 45 |
mixed | 8 | 13 | 11 | 32 |
pro | 10 | 9 | 64 | 83 |
total | 42 | 30 | 88 | 160 |
In the (necessarily) square table above, the main diagonal counts represent movies where both raters agreed. Let the term \(\pi_\) denote the probability that Siskel classifies the move in category \(i\), and Ebert classifies the same movie in category \(j\). For example, \(\pi_\) is the probability that Siskel rates a movie as "con", but Ebert rates it as "pro".
The term \(\sum_ \pi_\) then is the total probability of agreement. The extreme case that all observations are classified on the main diagonal is known as "perfect agreement".
Stop and Think! Is it possible to define perfect disagreement?Kappa measures agreement. A perfect agreement is when all of the counts fall on the main diagonal of the table, and the probability of agreement will be equal to 1.
To define perfect disagreement, the ratings of the movies, in this case, would have to be opposite one another, ideally in the extremes. In a \(2 \times 2\) table it is possible to define perfect disagreement because each positive rating could have one specific negative rating (e.g. Love vs. Hate it), but what about a \(3 \times 3\) or higher square tables? In these cases there are more ways that one might disagree and it therefore quickly gets more complicated to disagree perfectly. To think of perfect disagreement we would have to have a situation that minimizes agreement in any combination, and in higher way tables this would likely be a situation where there are no counts in some cells because it would be impossible to have perfect disagreement across all combinations at the same time.
In a \(3 \times 3\) table, here are two options that would provide no agreement at all (the # indicating a count):
1 | 2 | 3 | |
---|---|---|---|
1 | 0 | 0 | # |
2 | 0 | 0 | 0 |
3 | # | 0 | 0 |
1 | 2 | 3 | |
---|---|---|---|
1 | 0 | # | 0 |
2 | # | 0 | 0 |
3 | 0 | 0 | 0 |
Cohen’s kappa is a single summary index that describes strength of inter-rater agreement.
For \(I \times I\) tables, it’s equal to
This statistic compares the observed agreement to the expected agreement, computed assuming the ratings are independent.
The null hypothesis that the ratings are independent is, therefore, equivalent to
If the observed agreement is due to chance only, i.e., if the ratings are completely independent, then each diagonal element is a product of the two marginals.
Since the total probability of agreement is \(\sum_ \pi_\), then the probability of agreement under the null hypothesis equals to \(\sum_ \pi_\pi_\). Note also that \(\sum_ \pi_ = 0\) means no agreement, and \(\sum_ \pi_ = 1\) indicates perfect agreement. The kappa statistic is defined so that a larger value implies stronger agreement. Furthermore,
Note! Notice that strong agreement implies strong association, but strong association may not imply strong agreement. For example, if Siskel puts most of the movies into the con category while Ebert puts them into the pro category, the association might be strong, but there is certainly no agreement. You may also think of the situation where one examiner is tougher than the other. The first one consistently gives one grade less than the more lenient one. In this case, also the association is very strong but agreement may be insignificant.
Under multinomial sampling, the sampled value \(\hat<\kappa>\) has a large-sample normal distribution. Thus, we can rely on the asymptotic 95% confidence interval.
In SAS, use the option AGREE as shown below and in the SAS program MovieCritics.sas.
data critic; input siskel $ ebert $ count ; datalines; con con 24 con mixed 8 con pro 13 mixed con 8 mixed mixed 13 mixed pro 11 pro con 10 pro mixed 9 pro pro 64 ; run; proc freq; weight count; tables siskel*ebert / agree chisq; run;
From the output below, we can see that the "Simple Kappa" gives the estimated kappa value of 0.3888 with its asymptotic standard error (ASE) of 0.0598. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported 95% confidence interval, \(\kappa\) falls somewhere between 0.2716 and 0.5060 indicating only a moderate agreement between Siskel and Ebert.
Kappa Statistics | ||||
---|---|---|---|---|
Statistic | Estimate | Standard Error | 95% Confidence Limits | |
Simple Kappa | 0.3888 | 0.0598 | 0.2716 | 0.5060 |
Weighted Kappa | 0.4269 | 0.0635 | 0.3024 | 0.5513 |
Sample Size = 160
In R, we can use the Kappa function in the vcd package. The following are from the script MovieCritics.R.
critic = matrix(c(24,8,10,8,13,9,13,11,64),nr=3, dimnames=list("siskel"=c("con","mixed","pro"),"ebert"=c("con","mixed","pro"))) critic # chi-square test for independence between raters result = chisq.test(critic) result # kappa coefficient for agreement library(vcd) kappa = Kappa(critic)
From the output below, we can see that the "unweighted" statistic gives the estimated kappa value of 0.389 with its asymptotic standard error (ASE) of 0.063. The difference between observed agreement and expected under independence is about 40% of the maximum possible difference. Based on the reported values, the 95% confidence interval for \(\kappa\) ranges from 0.27 to 0.51, indicating only a moderate agreement between Siskel and Ebert.
> kappa value ASE z Pr(>|z|) Unweighted 0.3888 0.05979 6.503 7.870e-11 Weighted 0.4269 0.06350 6.723 1.781e-11 > confint(kappa) Kappa lwr upr Unweighted 0.2716461 0.5060309 Weighted 0.3024256 0.5513224
Kappa strongly depends on the marginal distributions. That is the same rating but with different proportions of cases in different categories can give very different \(\kappa\) values. This is one reason why the minimum value of \(\kappa\) depends on the marginal distribution and the minimum possible value of \(-1\) is not always attainable.
Modeling agreement (e.g., via log-linear or other models) is typically a more informative approach.
Weighted kappa is a version of kappa used for measuring agreement on ordered variables, where certain disagreements (e.g., lowest versus highest) can be weighted as more or less important.