2009 AWebSurveyOfTheUseOfActiveLearning

Jump to: navigation, search

Subject Headings: Semi-Automated Annotation Task.


Cited By



As supervised machine learning methods for addressing tasks in natural language processing (NLP) prove increasingly viable, the focus of attention is naturally shifted towards the creation of training data. The manual annotation of corpora is a tedious and time consuming process. To obtain high-quality annotated data constitutes a bottleneck in machine learning for NLP today. Active learning is one way of easing the burden of annotation. This paper presents a first probe into the NLP research community concerning the nature of the annotation projects undertaken in general, and the use of active learning as annotation support in particular.

1 Introduction

Supervised machine learning methods have been successfully applied to many NLP tasks in the last few decades. While these techniques have shown to work well, they require large amounts of labeled training data in order to achieve high performance. Creating such training data is a tedious, time consuming and error prone process. Active learning (AL) is a supervised learning technique that can be used to reduce the annotation effort. The main idea in AL is to put the machine learner in control of the data from which it learns; the learner can ask an oracle (typically a human) about the labels of the examples for which the model learned so far makes unreliable predictions. The active learning process takes as input a set of labeled examples, as well as a larger set of unlabeled examples, and produces a classifier and a relatively small set of newly labeled data. The overall goal is to create as good a classifier as possible, without having to mark-up and supply the learner with more data than necessary. AL aims at keeping the human annotation effort to a minimum, only asking the oracle for advice where the training utility of the result of such a query is high. Settles (2009) gives a detailed overview of the literature on AL.

It has been experimentally shown that AL can indeed be successfully applied to a range of NLP tasks including, e.g., text categorization (Lewis and Gale, 1994), part-of-speech tagging (Dagan and Engelson, 1995; Ringger et al., 2007), parsing (Becker and Osborne, 2005), and named entity recognition (Shen et al., 2004; Tomanek et al., 2007). Despite that somewhat impressive results in terms of reduced annotation effort have been achieved by such studies, it seems that AL is rarely applied in real-life annotation endeavors.

This paper presents the results from a web survey we arranged to analyze the extent to which AL has been used to support the annotation of textual data in the context of NLP, as well as addressing the reasons to why or why not AL has been found applicable to a specific task. Section 2 describes the survey in general, Section 3 introduces the questions and presents the answers received. Finally, the answers received are discussed in Section 4.


  • 1. Markus Becker and Miles Osborne. (2005). A two-stage method for active learning of statistical grammars. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pages 991--996.
  • 2. Ido Dagan and Sean P. Engelson. (1995). Committee-based sampling for training probabilistic classifiers. In: Proceedings of the 12th International Conference on Machine Learning, pages 150--157.
  • 3. David D. Lewis, William A. Gale, A sequential algorithm for training text classifiers, Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, p.3-12, July 03-06, 1994, Dublin, Ireland
  • 4. Eric Ringger, Peter McClanahan, Robbie Haertel, George Busby, Marc Carmen, James Carroll, Kevin Seppi, and Deryle Lonsdale. (2007). Active learning for part-of-speech tagging: Accelerating corpus annotation. In: Proceedings of the Linguistic Annotation Workshop, pages 101--108.
  • 5. Burr Settles. (2009). Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison.
  • 6. Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, Chew-Lim Tan, Multi-criteria-based active learning for named entity recognition, Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p.589-es, July 21-26, 2004, Barcelona, Spain doi:10.3115/1218955.1219030
  • 7. Katrin Tomanek, Joachim Wermter, and Udo Hahn. (2007). An approach to text corpus construction which cuts annotation costs and maintains reusability of annotated data. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 486--495.,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 AWebSurveyOfTheUseOfActiveLearningKatrin Tomanek
Fredrik Olsson
A Web Survey on the Use of Active Learning to Support Annotation of Text DataProceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processinghttp://www.aclweb.org/anthology/W/W09/W09-1906.pdf2009