2004 UnderstandingTheYarowskyAlgorithm

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Yarowsky Algorithm, Bootstrapped Weakly-Supervised Learning Algorithm.

Notes

Cited By

Quotes

Abstract

Many problems in computational linguistics are well suited for bootstrapping (semi-supervised learning) techniques. The Yarowsky algorithm is a well-known bootstrapping algorithm, but it is not mathematically well understood. This paper analyzes it as optimizing an objective function. More specifically, a number of variants of the Yarowsky algorithm (though not the original algorithm itself ) are shown to optimize either likelihood or a closely related objective function K.

Introduction

Bootstrapping, or semi-supervised learning, has become an important topic in computational linguistics. For many language-processing tasks, there is an abundance of unlabeled data, but labeled data is lacking and too expensive to create in large quantities, making bootstrapping techniques desirable.

The Yarowsky algorithm (Yarowsky, 1995) was one of the first bootstrapping algorithms to become widely known in computational linguistics. The Yarowsky algorithm, in brief, consists of two loops. The “inner loop” or base learner is a supervised learning algorithm. Specifically, Yarowsky uses a simple decision list learner that considers rules of the form, “If instance x contains feature f, then predict label j,” and selects those rules whose precision on the training data is highest.

The “outer loop” is given a seed set of rules to start with. In each iteration, it uses the current set of rules to assign labels to unlabeled data. It selects those instances on which the base learner’s predictions are most confident, and constructs a labeled training set from them. It then calls the inner loop to construct a new classifier (that is, a new set of rules), and the cycle repeats.

An alternative algorithm, co-training (Blum and Mitchell, 1998), has subsequently become more popular, perhaps in part because it has proven amenable to theoretical analysis (Dasgupta, Littman, and McAllester, 2001), in contrast to the Yarowsky algorithm, which is as yet mathematically poorly understood. The current paper aims to rectify that lack, increasing the attractiveness of the Yarowsky algorithm as an alternative to co-training. The Yarowsky algorithm does have the advantage of placing less restriction on the data sets it can be applied to. Co-training requires data attributes to be separable into two views that are conditionally independent given the target label; the Yarowsky algorithm makes no such assumption about its data.

Conclusion

In this paper, we have presented a number of variants of the Yarowsky algorithm, and we have shown that they optimize natural objective functions. We considered first the modified generic Yarowsky algorithm Y-1 and showed that it minimizes the objective function H (which is equivalent to maximizing likelihood), provided that its base learner reduces H.

We then considered three families of specific Yarowsky-like algorithms. The Y-1/DLEM algorithms (Y-1/DL-EM- and Y-1/DL-EM-X) minimize H, but have the disadvantage that the DL-EM base learner has no similarity to Yarowsky’s original base learner. A much better approximation to Yarowsky’s original base learner is provided by DL-1, and the Y-1/DL-1 algorithms (Y-1/DL-1-R and Y-1/DL-1-VS) were shown to minimize the objective function K, an upper bound for H. Finally, the YS algorithms (YS-P, YS-R, and YS-FS) are sequential variants, reminiscent of the Yarowsky-Cautious algorithm of Collins & Singer; we showed that they minimize K.

To the extent that these algorithms capture the essence of the original Yarowsky algorithm, they provide a formal understanding of Yarowsky’s approach. Even if they are deemed to diverge too much from the original to cast light on its workings, they at least represent a new family of bootstrapping algorithms with solid mathematical foundations.

References

  • Abney, Steven. (2002). Bootstrapping. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 360–367.
  • Blum, A. and Tom M. Mitchell. (1998). Combining Labeled and Unlabeled Data with Co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory (COLT), pages 92–100. Morgan Kaufmann Publishers.
  • Michael Collins and Yoram Singer. (1999). Unsupervised models for named entity classification. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pages 100–110.
  • Dasgupta, Sanjoy, Michael Littman, and David McAllester. (2001). PAC generalization bounds for co-training. In Advances in Neural Information Processing Systems 14 (NIPS).
  • David Yarowsky. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189–196.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 UnderstandingTheYarowskyAlgorithmSteven P. AbneyUnderstanding the Yarowsky Algorithmhttp://www.vinartus.net/spa/03c-v7.pdf10.1162/0891201041850876