Inter-Rater Reliability (IRR) Score

From GM-RKB
Jump to navigation Jump to search

An Inter-Rater Reliability (IRR) Score is a Measure of Agreement that is Rating Consistency Score given by the same person across multiple instances.



References

2021

  • (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Inter-rater_reliability Retrieved:2021-8-1.
    • In statistics, inter-rater reliability (also called by various similar names, such as inter-rater agreement, inter-rater concordance, inter-observer reliability, and so on) is the degree of agreement among independent observers who rate, code, or assess the same phenomenon.

      In contrast, intra-rater reliability is a score of the consistency in ratings given by the same person across multiple instances. For example, the grader should not let elements like fatigue influence their grading towards the end, or let a good paper influence the grading of the next paper. The grader should not compare papers together, but they should grade each paper based on the standard.

      Inter-rater and intra-rater reliability are aspects of test validity. Assessments of them are useful in refining the tools given to human judges, for example, by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.

      There are a number of statistics that can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are joint-probability of agreement, Cohen's kappa, Scott's pi and the related Fleiss' kappa, inter-rater correlation, concordance correlation coefficient, intra-class correlation, and Krippendorff's alpha.

2020

Translators T1,T5 T2 T3 T4
Test Set 1 (1-500 sent.) S1 S2 D1 D2
Test Set 2 (501-1000 sent.) D2 D1 S2 S1
Table 3: : Distribution of tasks where S is sentence level and D is document level scenarios, and 1 and 2 is the order of the tasks.