2014 CorpusAnnotationthroughCrowdsou

(Sabou et al., 2014) ⇒ Marta Sabou, Kalina Bontcheva, Leon Derczynski, and Arno Scharl. (2014). “Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines.” In: Proc. LREC.

Subject Headings: Crowdsourced Corpus Annotation

Notes

Cited By

http://scholar.google.com/scholar?q=%22Corpus+annotation+through+crowdsourcing%3A+Towards+best+practice+guidelines%22+2014

Quotes

Author Keywords

Crowdsourcing; Human Computation; Corpus Annotation; Guidelines; Survey

Abstract

Crowdsourcing is an emerging collaborative approach that can be used for the acquisition of annotated corpora and a wide range of other linguistic resources. Although the use of this approach is intensifying in all its key genres (paid-for crowdsourcing, games with a purpose, volunteering-based approaches), the community still lacks a set of best-practice guidelines similar to the annotation best practices for traditional, expert-based corpus acquisition. In this paper we focus on the use of crowdsourcing methods for corpus acquisition and propose a set of best practice guidelines based in our own experiences in this area and an overview of related literature. We also introduce GATE Crowd, a plugin of the GATE platform that relies on these guidelines and offers tool support for using crowdsourcing in a more principled and efficient manner.

1. Introduction

Over the past ten years, Natural Language Processing (NLP) research has been driven forward by a growing volume of annotated corpora, produced by evaluation initiatives such as ACE (ACE, 2004), TAC,^[1] SemEval and Senseval, ^[2] and large annotation projects such as OntoNotes (Hovy et al., 2006). These corpora have been essential for training and domain adaptation of NLP algorithms and their quantitative evaluation, as well as for enabling algorithm comparison and repeatable experimentation. Thanks to these efforts, there are now well-understood best practices in how to create annotations of consistently high quality, by employing, training, and managing groups of linguistic and/or domain experts. This process is referred to as “the science of annotation” (Hovy, 2010).

More recently, the emergence of crowdsourcing platforms (e.g. paid-for marketplaces such as Amazon Mechanical Turk (AMT) and CrowdFlower (CF); games with a purpose; and volunteer-based platforms such as crowdcrafting), coupled with growth in internet connectivity, motivated NLP researchers to experiment with crowdsourcing as a novel, collaborative approach for obtaining linguistically annotated corpora. The advantages of crowdsourcing over expert-based annotation have already been discussed elsewhere (Fort et al., 2011; Wang et al., 2012), but in a nutshell, crowdsourcing tends to be cheaper and faster.

There are now a large and continuously growing number of papers, which have used crowdsourcing in order to create annotated data for training and testing a wide range of NLP algorithms, as detailed in Section 2. and listed in Table 1. As the practice of using crowdsourcing for corpus annota- tion has become more widespread, so has the need for a best practice synthesis, spanning all three crowdsourcing genres and generalising from the specific NLP annotation task re- ported in individual papers. The meta-review of (Wang et al., 2012) discusses the trade-offs of the three crowdsourcing genres, alongside dimensions such as contributor motivation, setup effort, and human participants. While this review answers some key questions in using crowdsourcing, it does not provide a summary of best practice in how to setup, execute, and manage a complete crowdsourcing annotation project. In this paper we aim to address this gap by putting forward a set of best practice guidelines for crowdsourced corpus acquisition (Section 3.) and introducing GATE Crowd, an extension of the GATE NLP platform that facilitates the creation of crowdsourced tasks based on best practices and their integration into larger NLP processes (Section 4.).

2. Crowdsourcing Approaches

Crowdsourcing paradigms for corpus creation can be placed into one of three categories: mechanised labour, where workers are rewarded financially; games with a purpose, where the task is presented as a game; and altruistic work, relying on goodwill.

Mechanised labour has been used to create corpora that support a broad range of NLP problems (Table 1). Highly popular are NLP problems that are inherently subjective and cannot yet be reliably solved automatically, such as sentiment and opinion mining (Mellebeek et al., 2010), word sense disambiguation (Parent and Eskenazi, 2010), textual entailment (Negri et al., 2011), question answering (Heilman and Smith, 2010). Others create corpora of special resource types such as emails (Lawson et al., 2010), twitter feeds (Finin et al., 2010), augmented and alternative communication texts (Vertanen and Kristensson, 2011).

One advantage of crowdsourcing is “access to foreign markets with native speakers of many rare languages” (Zaidan and Callison-Burch, 2011). This feature is particularly useful for those that work on less-resourced languages such as Arabic (El-Haj et al., 2010) and Urdu (Zaidan and Callison-Burch, 2011). Irvine and Klementiev (2010) demonstrated that it is possible to create lexicons between English and 37 out of the 42 low-resource languages they examined. Games with a purpose (GWAPs) for annotation include Phratris (annotating sentences with syntactic dependen- cies) (Attardi, 2010), PhraseDetectives (Poesio et al., 2012) (anaphora annotations), and Sentiment Quiz (Scharl et al., 2012) (sentiment). GWAP-based approaches for collecting speech data include VoiceRace (McGraw et al., 2009), a GWAP+MTurk approach, where participants see a definition on a flashcard and need to guess and speak the corresponding word, which is then transcribed automatically by a speech recognizer; VoiceScatter (Gruenstein et al., 2009), where players must connect word sets with their definitions; Freitas et al.’s GWAP (Freitas et al., 2010), where players speak answers to graded questions in different knowledge domains; and MarsEscape (Chernova et al., 2010), a two-player game for collecting large-scale data for human-robot interaction.

An early example of leveraging volunteer contributions is Open Mind Word Expert , a Web interface that allows volunteers to tag words with their appropriate sense from WordNet in order to collect training data for the Senseval campaigns (Chklovski and Mihalcea, 2002). Also, the MNH (“Translation for all”) platform tries to foster the formation of a community through functionalities such as social networking and group definition support (Abekawa et al., 2010). Lastly, crowdcrafting.org is a community platform where NLP-based applications can be deployed.

Notably, volunteer projects that have not been conceived with a primary NLP interest but which delivered results that are useful in solving NLP problems are (i) Wikipedia, (ii) The Open Mind Common Sense project for collecting general world knowledge from volunteers in multiple languages, a key source for the ConceptNet semantic network that can enable various text understanding tasks; (iii) or Freebase a structured, graph-based knowledge repository offering information about almost 22 million entities constructed both by automatic means but also through contributions from thousands of volunteers.

3. Best Practice Guidelines

Conceptually, the process of crowdsourcing language resources can be broken down into four main stages, outlined in Figure 3. and discussed in the following subsections. These stages have been identified based on generalising our experience with crowdsourced corpus acquisition (Rafelsberger and Scharl, 2009; Scharl et al., 2012; Sabou et al., 2013a; Sabou et al., 2013b) and a meta-analysis of other crowdsourcing projects summarized in Table 1

…

References

Abekawa, T., Utiyama, M., Sumita, E., and Kageura, K. (2010). Community-based Construction of Draft and Final Translation Corpus through a Translation Hosting Site Minna no Hon’yaku (MNH). In: Proceedings LREC
ACE, (2004). Annotation Guidelines for Event Detection and Characterization (EDC), Feb. Available at http://www.ldc.upenn.edu/Projects/ACE/. Aker, A., El-Haj, M., Albakour, M.-D., and Kruschwitz, U. (2012). Assessing crowdsourcing quality through objective tasks. In: Proceedings LREC , pages 1456–1461.
Attardi, G. (2010). Phratris – A Phrase Annotation Game. In INSEMTIVES Game Idea Challenge
Behrend, T., Sharek, D., Meade, A., and Wiebe, E. (2011). The viability of crowdsourcing for survey research. Behav. Res. , 43(3).
Biewald, L. (2012). Massive multiplayer human computation for fun, money, and survival. In Current Trends in Web Engineering , pages 171–176. Springer. Bontcheva, K., Derczynski, L., and Roberts, I. (2014a).
Crowdsourcing named entity recognition and entity linking corpora. In Handbook of Linguistic Annotation
Bontcheva, K., Roberts, I., and Derczynski, L. (2014b). The GATE Crowdsourcing Plugin: Crowdsourcing Annotated Corpora Made Easy. In: Proceedings EACL
Callison-Burch, C. and Dredze, M., editors. (2010). Proceedings of of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk
Chamberlain, J., Poesio, M., and Kruschwizt, U. (2009). A new life for a dead parrot: Incentive structures in the Phrase Detectives game. In: Proceedings of the Webcentives Workshop
Chernova, S., Orkin, J., and Breazeal, C. (2010). Crowdsourcing HRI through Online Multiplayer Games. In Dialog with Robots: Papers from the AAAI Fall Symposium (FS-10-05)
Chklovski, T. and Mihalcea, R. (2002). Building a Sense Tagged Corpus with Open Mind Word Expert. In: Proceedings of the ACL Workshop on Word Sense Disambiguation: Recent Successes and Future Directions
Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenen, M., Leaver-Fay, A., Baker, D., Popovic, Z., and players, F. (2010). Predicting protein structures with a multiplayer online game. Nature , 466(7307).
Cunningham, H., Tablan, V., Roberts, A., and Bontcheva, K. (2013). Getting more out of biomedical documents with GATE’s full lifecycle open source text analytics. PLoS computational biology , 9(2):e1002854.
Doan, A., Ramakrishnan, R., and Halevy, A. Y. (2011). Crowdsourcing Systems on the World-Wide Web. Commun. ACM , 54(4), April. El-Haj, M., Kruschwitz, U., and Fox, C. (2010). Using Mechanical Turk to Create a Corpus of Arabic Summaries. In: Proceedings LREC
Feng, D., Besana, S., and Zajac, R. (2009). Acquiring High Quality Non-Expert Knowledge from On-Demand Workforce. In: Proceedings of The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources
Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., and Dredze, M. (2010). Annotating Named Entities in Twitter Data with Crowdsourcing. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
Fort, K. and Sagot, B. (2010). Influence of Pre-annotation on POS-tagged Corpus Development. In: Proceedings of the Fourth Linguistic Annotation Workshop
Fort, K., Adda, G., and Cohen, K. (2011). Amazon Mechanical Turk: Gold Mine or Coal Mine? Computational Linguistics , 37(2):413 –420.
Freitas, J., Calado, A., Braga, D., Silva, P., and Dias, M. (2010). Crowdsourcing platform for large-scale speech data collection. Proceedings of FALA, Vigo
Gruenstein, E., Mcgraw, I., and Sutherl, A. (2009). A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game. In: Proceedings of The Speech and Language Technology in Education (SLaTE) Workshop
Heilman, M. and Smith, N. A. (2010). Rating Computer-Generated Questions with Mechanical Turk. In Callison- Burch and Dredze (Callison-Burch and Dredze, 2010).
Hong, J. and Baker, C. F. (2011). How Good is the Crowd at ”real” WSD? In: Proceedings of the 5th Linguistic Annotation Workshop
Hovy, E., Marcus, M. P., Palmer, M., Ramshaw, L. A., and Weischedel, R. M. (2006). OntoNotes: The 90% Solution. In: Proceedings NAACL
Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., and Hovy, E. (2013). Learning Whom to trust with MACE. In: Proceedings of NAACL-HLT , pages 1120–1130.
Hovy, E. (2010). Annotation. In Tutorial Abstracts of ACL
Irvine, A. and Klementiev, A. (2010). Using Mechanical Turk to Annotate Lexicons for Less Commonly Used Languages. In Callison-Burch and Dredze (Callison- Burch and Dredze, 2010).
Jha, M., Andreas, J., Thadani, K., Rosenthal, S., and McKeown, K. (2010). Corpus Creation for New Genres: A Crowdsourced Approach to PP Attachment. In Callison- Burch and Dredze (Callison-Burch and Dredze, 2010).
Kawrykow, A., Roumanis, G., Kam, A., Kwak, D., Leung, C., Wu, C., Zarour, E., and players, P. (2012). Phylo: A Citizen Science Approach for Improving Multiple Sequence Alignment. PLoS ONE , 7(3):e31362.
Khanna, S., Ratan, A., Davis, J., and Thies, W. (2010). Evaluating and improving the usability of Mechanical Turk for low-income workers in India. In: Proceedings of the first ACM symposium on Computing for Development . ACM. Kittur, A., Chi, E. H., and Suh, B. (2008). Crowdsourcing User Studies with Mechanical Turk. In: Proceedings of the 26th Conference on Human Factors in Computing Systems
Laws, F., Scheible, C., and Sch ¨ utze, H. (2011). Active Learning with Amazon Mechanical Turk. In: Proceedings EMNLP
Lawson, N., Eustice, K., Perkowitz, M., and Yetisgen- Yildiz, M. (2010). Annotating Large Email Datasets for Named Entity Recognition with Mechanical Turk. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
Mason, W. and Watts, D. J. (2010). Financial incentives and the performance of crowds. ACM SigKDD Explorations Newsletter , 11(2):100–108.
McCreadie, R., Macdonald, C., and Ounis, I. (2012). Identifying Top News Using Crowdsourcing. Information Retrieval . 10.1007/s10791-012-9186-z.
McGraw, I., Gruenstein, A., and Sutherland, A. (2009). A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game. In: Proceedings of INTERSPEECH
Mellebeek, B., Benavent, F., Grivolla, J., Codina, J., Costajuss ` a, M. R., and Banchs, R. (2010). Opinion Mining of Spanish Customer Comments with Non-Expert Annotations on Mechanical Turk. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
Munro, R., Bethard, S., Kuperman, V., Lai, V. T., Melnick, R., Potts, C., Schnoebelen, T., and Tily, H. (2010). Crowdsourcing and Language Studies: The New Generation of Linguistic Data. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
Negri, M. and Mehdad, Y. (2010). Creating a Bi-lingual Entailment Corpus through Translations with Mechanical Turk : 100 for a 10-day Rush. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D., and Marchetti, A. (2011). Divide and Conquer: Crowdsourcing the Creation of Cross-Lingual Textual Entailment Corpora. In: Proceedings EMNLP
Parent, G. and Eskenazi, M. (2010). Clustering Dictionary Definitions Using Amazon Mechanical Turk. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
Poesio, M., Kruschwitz, U., Chamberlain, J., Robaldo, L., and Ducceschi, L. (2012). Phrase Detectives: Utilizing Collective Intelligence for Internet-Scale Language Resource Creation. Transactions on Interactive Intelligent Systems
Poesio, M., Chamberlain, J., and Kruschwitz, U. (2014). Crowdsourcing. In Handbook of Linguistic Annotation
Springer. Rafelsberger, W. and Scharl, A. (2009). Games with a Purpose for Social Networking Platforms. In: Proceedings ACM conference on Hypertext and Hypermedia
Rosenthal, S., Lipovsky, W., McKeown, K., Thadani, K., and Andreas, J. (2010). Towards Semi-Automated Annotation for Prepositional Phrase Attachment. In: Proceedings LREC
Sabou, M., Bontcheva, K., Scharl, A., and F ¨ ols, M. (2013a). Games with a Purpose or Mechanised Labour? A Comparative Study. In: Proceedings International Conference on Knowledge Management and Knowledge Technologies
Sabou, M., Scharl, A., and F ¨ ols, M. (2013b). Crowdsourced Knowledge Acquisition: Towards Hybrid-Genre Workflows. International Journal on Semantic Web and Information Systems , 9(3).
Sayeed, A. B., Rusk, B., Petrov, M., Nguyen, H. C., Meyer, T. J., and Weinberg, A. (2011). Crowdsourcing syntactic relatedness judgements for opinion mining in the study of information technology adoption. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH ’11)
Scharl, A., Sabou, M., Gindl, S., Rafelsberger, W., and Weichselbraun, A. (2012). Leveraging the wisdom of the crowds for the acquisition of multilingual language resources. In: Proceedings LREC
Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y. (2008). Cheap and Fast — but is it Good?: Evaluating Non-Expert Annotations for Natural Language Tasks. In: Proceedings EMNLP
Vertanen, K. and Kristensson, P. O. (2011). The Imagination of Crowds: Conversational AAC Language Modeling using Crowdsourcing and Large Data Sources. In: Proceedings EMNLP
Voyer, R., Nygaard, V., Fitzgerald, W., and Copperman, H. (2010). A Hybrid Model for Annotating Named Entity Training Corpora. In: Proceedings of the Fourth Linguistic Annotation Workshop (LAW IV ’10)
Wang, A., Hoang, C., and Kan, M. Y. (2012). Perspectives on Crowdsourcing Annotations for Natural Language Processing. Language Resources and Evaluation
Yano, T., Resnik, P., and Smith, N. A. (2010). Shedding (a Thousand Points of) Light on Biased Language. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
Yetisgen-Yildiz, M., Solti, I., Xia, F., and Halgrim, S. R. (2010). Preliminary Experience with Amazon’s Mechanical Turk for Annotating Medical Named Entities. In Callison-Burch and Dredze (Callison-Burch and Dredze, 2010).
Zaidan, O. F. and Callison-Burch, C. (2011). Crowdsourcing Translation: Professional Quality from NonProfessionals. In: Proceedings ACL;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2014 CorpusAnnotationthroughCrowdsou	Kalina Bontcheva Marta Sabou Leon Derczynski Arno Scharl			Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines						2014

↑ 1 www.nist.gov/tac
↑ www.senseval.org

[1] 1 www.nist.gov/tac

[2] www.senseval.org

[1]

[2]