2000 LessIsMore

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Active Learning Task, Support Vector Machines, Document Classification Task.

Notes

Cited By

Quotes

  • We describe a simple active learning heuristic which greatly enhances the generalization behavior of support vector machines (SVMs) on several practical document classification tasks. We observe a number of benefits, the most surprising of which is that a SVM trained on a well-chosen subset of the available corpus frequently performs better than one trained on all available data. The heuristic for choosing this subset is simple to compute, and makes no use of information about the test set. Given that the training time of SVMs depends heavily on the training set size , our heuristic not only offers better performance with fewer data, it frequently does so in less time than the naive approach of training on all available data.

1. Introduction

  • There are many uses for a good document classifier — sorting mail into mailboxes, filtering spam or routing news articles. The problem is that learning to classify documents requires manually labelling more documents than a typical user can tolerate. This makes it an obvious target for active learning, where we can let the system ask for labels only on the documents which will most help the classifier learn. (See Tong and Koller (2000) in this volume for parallel research on this topic.)
  • In this paper, we describe the application of active learning to a support vector machine (SVM) document classifier. Although one can define an “optimal” (but greedy) active learner for SVMs, it is computationally impractical to implement. Instead, we use the simple, computationally efficient heuristic of labeling examples that lie closest to the SVM’s dividing hyperplane. Testing this heuristic on several domains, we observe a number of results, some of which are quite surprising. Compared with a SVM trained on randomly selected examples, the active learning heuristic provides significantly better generalization performance for a given number of training examples.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2000 LessIsMoreGreg Schohn
David Cohn
Less is More: Active Learning with Support Vector Machineshttp://www.cs.cmu.edu/~cohn/papers/alsvm.ps.gz