2001 ScalingToAVVLargeCorpForNLDisambig

From GM-RKB
Jump to navigation Jump to search

Subject Headings: There is no Data Like More Data Heuristic.

Notes

Cited By

2010

2008

2003

Quotes

Abstract

1 Introduction

  • Machine learning techniques, which automatic ally learn linguistic information from online text corpora, have been applied to a number of natural language problems throughout the last decade. A large percentage of papers published in this area involve comparisons of different learning approaches trained and tested with commonly used corpora. While the amount of available online text has been increasing at a dramatic rate, the size of training corpora typically used for learning has not. In part, this is due to the standardization of data sets used within the field, as well as the potentially large cost of annotating data for those learning methods that rely on labeled text.
  • The empirical NLP community has put substantial effort into evaluating performance of a large number of machine learning methods over fixed, and relatively small, data sets. Yet since we now have access to significantly more data, one has to wonder what conclusions that have been drawn on small data sets may carry over when these learning methods are trained using much larger corpora.
  • In this paper, we present a study of the effects of data size on machine learning for natural language disambiguation. In particular, we study the problem of selection among confusable words, using orders of magnitude more training data than has ever been applied to this problem. First we show learning curves for four different machine learning algorithms. Next, we consider the efficacy of voting, sample selection and partially unsupervised learning with large training corpora, in hopes of being able to obtain the benefits that come from significantly larger training corpora without incurring too large a cost.

References


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2001 ScalingToAVVLargeCorpForNLDisambigMichele Banko
Eric D. Brill
Scaling to Very Very Large Corpora for Natural Language DisambiguationMeeting of the Association for Computational Linguisticshttp://acl.ldc.upenn.edu/P/P01/P01-1005.pdf2001