2008 CheapAndFast-ButIsItGood

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Amazon's Mechanical Turk, Text Annotation Task.

Notes

Cited By

2010

Quotes

Abstract

Human linguistic annotation is crucial for many natural language processing tasks but can be expensive and time-consuming. We explore the use of Amazon's Mechanical Turk system, a significantly cheaper and faster method for collecting annotations from a broad base of paid non-expert contributors over the Web. We investigate five tasks: affect recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation. For all five, we show high agreement between Mechanical Turk non-expert annotations and existing gold standard labels provided by expert labelers. For the task of affect recognition, we also show that using non-expert labels for training machine learning algorithms can be as effective as using gold standard annotations from experts. We propose a technique for bias correction that significantly improves annotation quality on two tasks. We conclude that many large labeling tasks can be effectively designed and carried out in this method at a fraction of the usual expense.

5 Bias correction for non-expert annotators

The reliability of individual workers varies. Some are very accurate, while others are more careless and make mistakes; and a small few give very noisy responses. Furthermore, for most AMT data collection experiments, a relatively small number of workers do a large portion of the task, since workers may do as much or as little as they please. Figure 6 shows accuracy rates for individual workers on one task. Both the overall variability, as well as the prospect of identifying high-volume but low-quality workers, suggest that controlling for individual worker quality could yield higher quality overall judgments.

In general, there are at least three ways to enhance quality in the face of worker error. More workers can be used, as described in previous sections. Another method is to use Amazon’s compensation mechanisms to give monetary bonuses to highly-performing workers and deny payments to unreliable ones; this is useful, but beyond the scope of this paper. In this section we explore a third alternative, to model the reliability and biases of individual workers and correct for them.

A wide number of methods have been explored to correct for the bias of annotators. Dawid and Skene (1979) are the first to consider the case of having multiple annotators per example but unknown true labels. They introduce an EM algorithm to simultaneously estimate annotator biases and latent label classes. Wiebe et al. (1999) analyze |linguistic annotator agreement statistics to find bias, and use a similar model to correct labels. A large literature in biostatistics addresses this same problem for medical diagnosis. Albert and Dodd (2004) review several related models, but argue they have various shortcomings and emphasize instead the importance of having a gold standard.

Here we take an approach based on gold standard labels, using a small amount of expert-labeled training data in order to correct for the individual biases of different non-expert annotators. The idea is to recalibrate worker’s responses to more closely match expert behavior. We focus on categorical examples, though a similar method can be used with numeric data.

References

  • Paul S. Albert and Lori E. Dodd. (2004). A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error without a Gold Standard. Biometrics, Vol. 60 (2004), pp. 427-435.
  • Collin F. Baker, Charles J. Fillmore and John B. Lowe. (1998). The Berkeley FrameNet project. In: Proceedings of COLING-ACL 1998.
  • Michele Banko and Eric Brill. (2001). Scaling to Very Very Large Corpora for Natural Language Disambiguation. In: Proceedings of ACL-2001.
  • Junfu Cai, Wee Sun Lee and Yee Whye Teh. (2007). Improving Word Sense Disambiguation Using Topic Features. In: Proceedings of EMNLP-2007 .
  • Timothy Chklovski and Rada Mihalcea. (2002). Building a sense tagged corpus with Open Mind Word Expert. In: Proceedings of the Workshop on ”Word Sense Disambiguation: Recent Successes and Future Directions”, ACL 2002.
  • Timothy Chklovski and Yolanda Gil. (2005). Towards Managing Knowledge Collection from Volunteer Contributors. Proceedings of AAAI Spring Symposium on Knowledge Collection from Volunteer Contributors (KCVC05).
  • Ido Dagan, Oren Glickman and Bernardo Magnini. (2006). The PASCAL Recognising Textual Entailment Challenge. Machine Learning Challenges. Lecture Notes in Computer Science, Vol. 3944, pp. 177-190, Springer, 2006.
  • Wisam Dakka and Panagiotis G. Ipeirotis. (2008). Automatic Extraction of Useful Facet Terms from Text Documents. In: Proceedings of ICDE-2008.
  • A. P. Dawid and A. M. Skene. 1979. Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm. Applied Statistics, Vol. 28, No. 1 (1979), pp. 20-28.
  • Michael Kaisser and John B. Lowe. (2008). A Research Collection of QuestionAnswer Sentence Pairs. In: Proceedings of LREC-2008.
  • Michael Kaisser, Marti Hearst, and John B. Lowe. (2008). Evidence for Varying Search Results Summary Lengths. In: Proceedings of ACL-2008.
  • Phil Katz, Matthew Singleton, Richard Wicentowski. (2007). SWAT-MP: The SemEval-2007 Systems for Task 5 and Task 14. In: Proceedings of SemEval-2007. Aniket Kittur, Ed H. Chi, and Bongwon Suh. (2008). Crowdsourcing user studies with Mechanical Turk. In: Proceedings of CHI-2008.
  • Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. (1993). Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19:2, June 1993.
  • George A. Miller and William G. Charles. (1991). Contextual Correlates of Semantic Similarity. Language and Cognitive Processes, vol. 6, no. 1, pp. 1-28, 1991.
  • George A. Miller, Claudia Leacock, Randee Tengi, and Ross T. Bunke. (1993). A semantic concordance. In: Proceedings of HLT-1993.
  • Preslav Nakov. (2008). Paraphrasing Verbs for Noun Compound Interpretation. In: Proceedings of the Workshop on Multiword Expressions, LREC-2008.
  • Martha Palmer, Dan Gildea, and Paul Kingsbury. (2005). The Proposition Bank: A Corpus Annotated with Semantic Roles. Computational Linguistics, 31:1.
  • Sameer Pradhan, Edward Loper, Dmitriy Dligach and Martha Palmer. (2007). SemEval-2007 Task-17: English Lexical Sample, SRL and All Words. In: Proceedings of SemEval-2007 .
  • James Pustejovsky, Patrick Hanks, Roser Saur, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro and Marcia Lazo. (2003). The TIMEBANK Corpus. In: Proceedings of Corpus Linguistics 2003, 647-656.
  • Philip Resnik. (1999). Semantic Similarity in a Taxonomy: An Information-BasedMeasure and its Application to Problems of Ambiguity in Natural Language. JAIR, Volume 11, pages 95-130.
  • Herbert Rubenstein and John B. Goodenough. 1965. Contextual Correlates of Synonymy. Communications of the ACM, 8(10):627–633.
  • (Sheng et al., 2008) ⇒ Victor S. Sheng, Foster Provost, and Panagiotis G. Ipeirotis. (2008). Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers. In: Proceedings of KDD-2008.
  • Push Singh. (2002). The public acquisition of commonsense knowledge. In: Proceedings of AAAI Spring Symposium on Acquiring (and Using) Linguistic (and World) Knowledge for Information Access, 2002.
  • Alexander Sorokin and David Forsyth. (2008). Utility data annotation with Amazon Mechanical Turk. To appear in Proceedings of First IEEE Workshop on Internet Vision at CVPR, 2008. See also: http://vision.cs.uiuc.edu/annotation/
  • David G. Stork. (1999). The Open Mind Initiative. IEEE Expert Systems and Their Applications pp. 16- 20, May/June 1999.
  • Carlo Strapparava and Rada Mihalcea. (2007). SemEval-2007 Task 14: Affective Text In: Proceedings of SemEval-2007.
  • Qi Su, Dmitry Pavlov, Jyh-Herng Chow, and Wendell C. Baker. (2007). Internet-Scale Collection of Human-Reviewed Data. In: Proceedings of WWW-2007.
  • Luis von Ahn and Laura Dabbish. (2004). Labeling Images with a Computer Game. In ACM Conference on Human Factors in Computing Systems, CHI 2004.
  • Luis von Ahn, Mihir Kedia and Manuel Blum. (2006). Verbosity: A Game for Collecting Common-Sense Knowledge. In ACM Conference on Human Factors in Computing Systems, CHI Notes 2006.
  • Ellen Voorhees and Hoa Trang Dang. (2006). Overview of the TREC 2005 question answering track. In: Proceedings of TREC-2005.
  • Janyce M. Wiebe, Rebecca F. Bruce and Thomas P. O’Hara. (1999). Development and use of a goldstandard data set for subjectivity classifications. In: Proceedings of ACL-1999.
  • Annie Zaenen. Submitted. Do give a penny for their thoughts. International Journal of Natural Language Engineering (submitted).

BibTeX

@inproceedings{2008_CheapAndFast-ButIsItGood,
  author    = {Rion Snow and
               Brendan O'Connor and
               Daniel Jurafsky and
               Andrew Y. Ng},
  title     = {Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations
               for Natural Language Tasks},
  booktitle = {Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing,
               (EMNLP 2008), 25-27 October 2008, Honolulu, Hawaii, USA. A meeting of SIGDAT,
               a Special Interest Group of the  ACL},
  pages     = {254--263},
  publisher = {ACL},
  year      = {2008},
  url       = {https://www.aclweb.org/anthology/D08-1027/},
}


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 CheapAndFast-ButIsItGoodDaniel Jurafsky
Rion Snow
Brendan O'Connor
Andrew Y. Ng
Cheap and Fast - But is it Good?: Evaluating non-expert annotations for natural language tasksProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008)http://www.robotics.stanford.edu/~ang/papers/emnlp08-MechanicalTurk.pdf2008