2001 ScalingToAVVLargeCorpForNLDisambig

(Banko & Brill, 2001) ⇒ Michele Banko, Eric D. Brill. (2001). “Scaling to Very Very Large Corpora for Natural Language Disambiguation.” In: Meeting of the Association for Computational Linguistics (ACL 2001).

Subject Headings: There is no Data Like More Data Heuristic.

Notes

It provides evidence that co-occurrence statistics are informative when computed over Very Large Corpora.

Cited By

~239 http://scholar.google.com/scholar?cites=7425166327993896057

Quotes

Abstract

The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.

1 Introduction

Machine learning techniques, which automatic ally learn linguistic information from online text corpora, have been applied to a number of natural language problems throughout the last decade. A large percentage of papers published in this area involve comparisons of different learning approaches trained and tested with commonly used corpora. While the amount of available online text has been increasing at a dramatic rate, the size of training corpora typically used for learning has not. In part, this is due to the standardization of data sets used within the field, as well as the potentially large cost of annotating data for those learning methods that rely on labeled text.
The empirical NLP community has put substantial effort into evaluating performance of a large number of machine learning methods over fixed, and relatively small, data sets. Yet since we now have access to significantly more data, one has to wonder what conclusions that have been drawn on small data sets may carry over when these learning methods are trained using much larger corpora.
In this paper, we present a study of the effects of data size on machine learning for natural language disambiguation. In particular, we study the problem of selection among confusable words, using orders of magnitude more training data than has ever been applied to this problem. First we show learning curves for four different machine learning algorithms. Next, we consider the efficacy of voting, sample selection and partially unsupervised learning with large training corpora, in hopes of being able to obtain the benefits that come from significantly larger training corpora without incurring too large a cost.

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2001 ScalingToAVVLargeCorpForNLDisambig	Michele Banko Eric D. Brill			Scaling to Very Very Large Corpora for Natural Language Disambiguation		Meeting of the Association for Computational Linguistics	http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf			2001

2001 ScalingToAVVLargeCorpForNLDisambig

Notes

Cited By

2010

2008

2003

Quotes

Abstract

1 Introduction

References

Navigation menu

Search