Human Inter-Annotator Agreement (IAA) Measure

A Human Inter-Annotator Agreement (IAA) Measure is an agreement measure of multi-classifier classification task performed by to or more human annotators or raters.

AKA: Measure of Classification Agreement, Human Inter-Rater Agreement Measure.
Context:
- It can range from being a Gold Set-based Classification Agreement Measure to being a Non-Gold Set-based Classification Agreement Measure.
Example(s):
- an Overlap Proportion Measure, such as an F-measure.
- a kappa Measure of Agreement, such as Cohen's kappa or Fleiss' kappa.
- …
Counter-Example(s):
See: Cross-Validation, BLEU Measure, Confusion Matrix, Machine Learning, Statistics, Binary Classification, Likert Scale.

References

2021

(Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Cohen's_kappa#Definition Retrieved:2021-8-1.
- Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The definition of [math]\displaystyle{ \kappa }[/math] is:
  [math]\displaystyle{ \kappa \equiv \dfrac{p_o - p_e}{1 - p_e} = 1- \dfrac{1 - p_o}{1 - p_e}, }[/math]
  where is the relative observed agreement among raters, and is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly seeing each category. If the raters are in complete agreement then [math]\displaystyle{ \kappa=1 }[/math]. If there is no agreement among the raters other than what would be expected by chance (as given by ), [math]\displaystyle{ \kappa=0 }[/math]. It is possible for the statistic to be negative, which implies that there is no effective agreement between the two raters or the agreement is worse than random.
  For categories, observations to categorize and [math]\displaystyle{ n_{ki} }[/math] the number of times rater predicted category :
  [math]\displaystyle{ p_e = \dfrac{1}{N^2} \sum_k n_{k1}n_{k2} }[/math]
  This is derived from the following construction:
  [math]\displaystyle{ p_e = \sum_k \widehat{p_{k12}} = \sum_k \widehat{p_{k1}}\widehat{p_{k2}} = \sum_k \dfrac{n_{k1}}{N}\dfrac{n_{k2}}{N} = \dfrac{1}{N^2} \sum_k n_{k1}n_{k2} }[/math]
  Where [math]\displaystyle{ \widehat{p_{k12}} }[/math] is the estimated probability that both rater 1 and rater 2 will classify the same item as k, while [math]\displaystyle{ \widehat{p_{k1}} }[/math] is the estimated probability that rater 1 will classify an item as k (and similarly for rater 2).
  The relation [math]\displaystyle{ \widehat{p_k} = \sum_k \widehat{p_{k1}}\widehat{p_{k2}} }[/math] is based on using the assumption that the rating of the two raters are independent. The term [math]\displaystyle{ \widehat{p_{k1}} }[/math] is estimated by using the number of items classified as k by rater 1 ([math]\displaystyle{ n_{k1} }[/math] ) divided by the total items to classify ([math]\displaystyle{ N }[/math] ): [math]\displaystyle{ \widehat{p_{k1}}= \dfrac{n_{k1}}{N} }[/math] (and similarly for rater 2).

2020

(Castilho, 2020) ⇒ Sheila Castilho (2020). "On the Same Page? Comparing Inter-Annotator Agreement in Sentence and Document Level Human Machine Translation Evaluation". In: Proceedings of the Fifth Conference on Machine Translation (WMT@EMNLP 2020) Online.
- QUOTE: We compute IAA with Cohen's Kappa (Cohen, 1960) both weighted (W) and non-weighted (NW) as the most common statistics for IAA,
  $\kappa=\dfrac{P(A)-P(E)}{1-P(E)}$
  where $P(A)$ represents the proportion of times that the annotators agree, and $P(E)$ the proportion of times that the annotators are expected to agree by chance. While NW Kappa does not take into account the degree of agreement, W Kappa uses a predefined table of weights to measure the degree of disagreement between the two raters, the higher the disagreement the higher the weight. It is important to notice that in this case, W Kappa can only be calculated for adequacy and fluency as they are assessed using a Likert scale.

2017

(CS140, 2017) ⇒ (2017). "CS140- Lecture7: Inter annotator agreement".
- QUOTE: Cohen's Kappa (κ).
  - Measures the agreement between two annotators, while taking into account the possibility of chance agreement. The equation is:
    - $Pr(a)$ actual agreement
    - $Pr(e)$ expected agreement

$K =\dfrac{Pr(a) - Pr(e)}{1- Pr(e)}$

2014

https://corpuslinguisticmethods.wordpress.com/2014/01/15/what-is-inter-annotator-agreement/
- QUOTE: … There are basically two ways of calculating inter-annotator agreement. The first approach is nothing more than a percentage of overlapping choices between the annotators. This approach is somewhat biased, because it might be sheer luck that there is a high overlap. Indeed, this might be the case if there are only a very limited amount of category levels (only yes versus no, or so), so the chance of having the same annotation is a priori already 1 out of 2. Also, it might be possible that the majority of observations belongs to one of the levels of the category, so that the a priori overlap is already potentially high.
  Therefore, an inter-annotator measure has been devised that takes such a priori overlaps into account. That measure is known as Kohen’s Kappa. To calculate inter-annotator agreement with Kohen’s Kappa, we need an additional package for R, called “irr”. Install it as follows:

2012a

(McHugh,2012) ⇒ Mary L. McHugh (2012) ⇒ "Interrater Reliability: the Kappa Statistic". In: Biochemia medica, 22(3), 276-282.
- QUOTE: The kappa statistic is frequently used to test interrater reliability. The importance of rater reliability lies in the fact that it represents the extent to which the data collected in the study are correct representations of the variables measured. Measurement of the extent to which data collectors (raters) assign the same score to the same variable is called interrater reliability.

2012b

(LingPipe Blog, 2012) ⇒ http://lingpipe-blog.com/2012/03/12/cross-validation-vs-inter-annotator-agreement-2/
- QUOTE: According to received wisdom in natural language processing, she’s left out a very important step of the standard operating procedure. She’s supposed to get another annotator to independently label the data and then measure inter-annotator agreement. So What?
  If we can train a system to perform at 80%+ F-measure under cross-validation, who cares if we can’t get another human to match Mitzi’s annotation? We have something better — we can train a system to match Mitzi’s annotation!

2008

(Morgan et al., 2008) ⇒ Alexander A. Morgan, Zhiyong Lu, Xinglong Wang, Aaron M. Cohen, Juliane Fluck, Patrick Ruch, Anna Divoli, Katrin Fundel, Robert Leaman, Jörg Hakenberg, Chengjie Sun, Heng-hui Liu, Rafael Torres, Michael Krauthammer, William W Lau, Hongfang Liu, Chun-Nan Hsu, Martijn Schuemie, K. Bretonnel Cohen, and Lynette Hirschman. (2008). “Overview of BioCreative II gene normalization.” In: Genome Biology, 9(Suppl 2):S3. doi:10.1186/gb-2008-9-s2-s3.
- QUOTE: We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%.

2006

(Hovy et al., 2006) ⇒ Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. (2006). “OntoNotes: the 90% solution.” In: Proceedings of the Human Language Technology Conference of the NAACL (HLT-NAACL 2006).
- QUOTE: We describe the OntoNotes methodology and its result, a large multilingual richly-annotated corpus constructed at 90% interannotator agreement. An initial portion (300K words of English newswire and 250K words of Chinese newswire) will be made available to the community during 2007.

1960

(Cohen, 1960) ⇒ Jacob Cohen (1960). "A Coefficient of Agreement for Nominal Scales". In: Educational and psychological measurement, 20(1), 37-46.
- QUOTE: The discussion thus far suggests that, for any problem in nominal scale agreement between two judges, there are only two relevant quantities:
  - $p_o$ = the proportion of units in which the judges agreed;
  - $p_e$ = the proportion of units for which agreement is expected by chance.

The test of agreement comes then with regard to the $1 - p_e$ of the units for which the hypothesis of no association would predict disagreement between the judges. This term will serve as the denominator.

To the extent to which nonchance factors are operating in the direction of agreement, $p_o$ will exceed $p_e$; their difference, $p_o - p_e$, represents the proportion of the cases in which beyond-chance agreement occurred and is the numerator of the coefficient.

The coefficient $\kappa$ is simply the proportion of chance-expected disagreements which do not occur, or alternatively, it is the proportion of agreement after chance agreement is removed from consideration:

$\kappa=\dfrac{p_o-p_e}{1-p_e}$

(1)