Jump to navigation Jump to search
- (Banko & Brill, 2001) ⇒ Michele Banko, Eric D. Brill. (2001). “Scaling to Very Very Large Corpora for Natural Language Disambiguation.” In: Meeting of the Association for Computational Linguistics (ACL 2001).
Subject Headings: There is no Data Like More Data Heuristic.
- It provides evidence that co-occurrence statistics are informative when computed over Very Large Corpora.
- (Jones, 2010) ⇒ Dean Jones. (2010). Blog Entry.
- (Snow et al., 2008) ⇒ Rion Snow, Brendan O'Connor, Daniel Jurafsky, and Andrew Y. Ng. (2008). “Cheap and Fast - But is it Good?: Evaluating non-expert annotations for natural language tasks.” In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2008).
- (Kilgarriff & Grefenstette, 2006) ⇒ Adam Kilgarriff, and Gregory Grefenstette. (2006). “Introduction to the Special Issue on the Web as Corpus.” In: Computational Linguistics, 29(3). doi:10.1162/089120103322711569
- Another argument is made vividly by Banko and Brill (2001). They explore the performance of a number of machine learning algorithms (on a representative disambiguation task) as the size of the training corpus grows from a million to a billion words. All the algorithms steadily improve in performance, though the question “Which is best?” gets different answers for different data sizes. The moral: Performance improves with data size, and getting more data will make more difference than fine-tuning algorithms.
- The amount of readily available on-line text has reached hundreds of billions of words and continues to grow. Yet for most core natural language tasks, algorithms continue to be optimized, tested and compared after training on corpora consisting of only one million words or less. In this paper, we evaluate the performance of different learning methods on a prototypical natural language disambiguation task, confusion set disambiguation, when trained on orders of magnitude more labeled data than has previously been used. We are fortunate that for this particular application, correctly labeled training data is free. Since this will often not be the case, we examine methods for effectively exploiting very large corpora when labeled data comes at a cost.
- Machine learning techniques, which automatic ally learn linguistic information from online text corpora, have been applied to a number of natural language problems throughout the last decade. A large percentage of papers published in this area involve comparisons of different learning approaches trained and tested with commonly used corpora. While the amount of available online text has been increasing at a dramatic rate, the size of training corpora typically used for learning has not. In part, this is due to the standardization of data sets used within the field, as well as the potentially large cost of annotating data for those learning methods that rely on labeled text.
- The empirical NLP community has put substantial effort into evaluating performance of a large number of machine learning methods over fixed, and relatively small, data sets. Yet since we now have access to significantly more data, one has to wonder what conclusions that have been drawn on small data sets may carry over when these learning methods are trained using much larger corpora.
- In this paper, we present a study of the effects of data size on machine learning for natural language disambiguation. In particular, we study the problem of selection among confusable words, using orders of magnitude more training data than has ever been applied to this problem. First we show learning curves for four different machine learning algorithms. Next, we consider the efficacy of voting, sample selection and partially unsupervised learning with large training corpora, in hopes of being able to obtain the benefits that come from significantly larger training corpora without incurring too large a cost.
Eric D. Brill
|Scaling to Very Very Large Corpora for Natural Language Disambiguation
|Meeting of the Association for Computational Linguistics