2010 WordRepresentationsASimpleandGe

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Distributional Word Representation.

Notes

Cited By

Quotes

Abstract

If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. We find further improvements by combining different word representations. You can download our word features, for off-the-shelf use in existing NLP systems, as well as our code, here: http://metaoptimize.com/projects/wordreprs/

3 Clustering-based word representations

Another type of word representation is to induce a clustering over words. Clustering methods and distributional methods can overlap. For example, Pereira et al. (1993) begin with a cooccurrence matrix and transform this matrix into a clustering.

3.1 Brown clustering

The Brown algorithm is a hierarchical clustering algorithm which clusters words to maximize the mutual information of bigrams (Brown et al., 1992). So it is a class-based bigram language model. It runs in time [math]\displaystyle{ O(V·K^2) }[/math], where [math]\displaystyle{ V }[/math] is the size of the vocabulary and [math]\displaystyle{ K }[/math] is the number of clusters. The hierarchical nature of the clustering means that we can choose the word class at several levels in the hierarchy, which can compensate for poor clusters of a small number of words. One downside of Brown clustering is that it is based solely on bigram statistics, and does not consider word usage in a wider context.

Brown clusters have been used successfully in a variety of NLP applications: NER (Miller et al., 2004; Liang, 2005; Ratinov & Roth, 2009), PCFG parsing (Candito & Crabbé, 2009), dependency parsing (Koo et al., 2008; Suzuki et al., 2009), and semantic dependency parsing (Zhao et al., 2009).

Martin et al. (1998) presents algorithms for inducing hierarchical clusterings based upon word bigram and trigram statistics. Ushioda (1996) presents an extension to the Brown clustering algorithm, and learn hierarchical clusterings of words as well as phrases, which they apply to POS tagging.

3.2 Other work on cluster-based word representations

Lin and Wu (2009) present a K-means-like non-hierarchical clustering algorithm for phrases, which uses MapReduce.

HMMs can be used to induce a soft clustering, specifically a multinomial distribution over possible clusters (hidden states). Li and McCallum (2005) use an HMM-LDA model to improve POS tagging and Chinese Word Segmentation. Huang and Yates (2009) induce a fully-connected HMM, which emits a multinomial distribution over possible vocabulary words. They perform hard clustering using the Viterbi algorithm. (Alternately, they could keep the soft clustering, with the representation for a particular word token being the posterior probability distribution over the states.) However, the CRF chunker in Huang and Yates (2009), which uses their HMM word clusters as extra features, achieves F1 lower than a baseline CRF chunker (Sha & Pereira, 2003). Goldberg et al. (2009) use an HMM to assign POS tags to words, which in turns improves the accuracy of the PCFG-based Hebrew parser. Deschacht and Moens (2009) use a latent-variable language model to improve semantic role labeling.

References

Bengio, Y. (2008). Neural Net Language Models. Scholarpedia, 3, 3881.

  • 3. }}Bengio, Y., Ducharme, R., & Vincent, P. (2001). A Neural Probabilistic Language Model. NIPS.
  • 4. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, Christian Janvin, A Neural Probabilistic Language Model, The Journal of Machine Learning Research, 3, 3/1/2003
  • 5. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, Jason Weston, Curriculum Learning, Proceedings of the 26th Annual International Conference on Machine Learning, p.41-48, June 14-18, 2009, Montreal, Quebec, Canada doi:10.1145/1553374.1553380
  • 6. }}Bengio, Y., & Séénécal, J.-S. (2003). Quick Training of Probabilistic Neural Nets by Importance Sampling. AISTATS.
  • 7. David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent Dirichlet Allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003
  • 8. Peter F. Brown, Peter V. DeSouza, Robert L. Mercer, Vincent J. Della Pietra, Jenifer C. Lai, Class-based n-gram Models of Natural Language, Computational Linguistics, v.18 n.4, p.467-479, December 1992
  • 9. Marie Candito, Benoît Crabbé, Improving Generative Statistical Parsing with Semi-supervised Word Clustering, Proceedings of the 11th International Conference on Parsing Technologies, October 07-09, 2009, Paris, France
  • 10. Ronan Collobert, Jason Weston, A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning, Proceedings of the 25th International Conference on Machine Learning, p.160-167, July 05-09, 2008, Helsinki, Finland doi:10.1145/1390156.1390177
  • 11. Koen Deschacht, Marie-Francine Moens, Semi-supervised Semantic Role Labeling Using the Latent Words Language Model, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1, August 06-07, 2009, Singapore
  • 12. S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, R. Harshman, Using Latent Semantic Analysis to Improve Access to Textual Information, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, p.281-285, May 15-19, 1988, Washington, D.C., USA doi:10.1145/57167.57214
  • 13. }}Elman, J. L. (1993). Learning and Development in Neural Networks: The Importance of Starting Small. Cognition, 48, 781--799.
  • 14. Yoav Goldberg, Reut Tsarfaty, Meni Adler, Michael Elhadad, Enhancing Unlexicalized Parsing Performance Using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities, Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, p.327-335, March 30-April 03, 2009, Athens, Greece
  • 15. }}Honkela, T. (1997). Self-organizing Maps of Words for Natural Language Processing Applications. Proceedings of the International ICSC Symposium on Soft Computing.
  • 16. }}Honkela, T., Pulkki, V., & Kohonen, T. (1995). Contextual Relations of Words in Grimm Tales, Analyzed by Self-organizing Map. ICANN.
  • 17. Fei Huang, Alexander Yates, Distributional Representations for Handling Sparsity in Supervised Sequence-labeling, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1, August 02-07, 2009, Suntec, Singapore
  • 18. }}Kaski, S. (1998). Dimensionality Reduction by Random Mapping: Fast Similarity Computation for Clustering. IJCNN (pp. 413--418).
  • 19. }}Koo, T., Carreras, X., & Collins, M. (2008). Simple Semi-supervised Dependency Parsing. ACL (pp. 595--603).
  • 20. Vijay Krishnan, Christopher D. Manning, An Effective Two-stage Model for Exploiting Non-local Dependencies in Named Entity Recognition, Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, p.1121-1128, July 17-18, 2006, Sydney, Australia doi:10.3115/1220175.1220316
  • 21. }}Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes, 259--284.
  • 22. Wei Li, Andrew McCallum, Semi-supervised Sequence Modeling with Syntactic Topic Models, Proceedings of the 20th National Conference on Artificial Intelligence, p.813-818, July 09-13, 2005, Pittsburgh, Pennsylvania
  • 23. }}Liang, P. (2005). Semi-supervised Learning for Natural Language. Master's Thesis, Massachusetts Institute of Technology.
  • 24. Dekang Lin, Xiaoyun Wu, Phrase Clustering for Discriminative Learning, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2, August 02-07, 2009, Suntec, Singapore
  • 25. }}Lund, K., & Burgess, C. (1996). Producing Highdimensional Semantic Spaces from Lexical Co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28, 203--208.
  • 26. }}Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and Associative Priming in High-dimensional Semantic Space. Cognitive Science Proceedings, LEA (pp. 660--665).
  • 27. Sven Martin, Jörg Liermann, Hermann Ney, Algorithms for Bigram and Trigram Word Clustering, Speech Communication, v.24 n.1, p.19-37, April 1, 1998 doi:10.1016/S0167-6393(97)00062-9
  • 28. }}Miller, S., Guinness, J., & Zamanian, A. (2004). Name Tagging with Word Clusters and Discriminative Training. HLT-NAACL (pp. 337--342).
  • 29. Andriy Mnih, Geoffrey Hinton, Three New Graphical Models for Statistical Language Modelling, Proceedings of the 24th International Conference on Machine Learning, p.641-648, June 20-24, 2007, Corvalis, Oregon doi:10.1145/1273496.1273577
  • 30. }}Mnih, A., & Hinton, G. E. (2009). A Scalable Hierarchical Distributed Language Model. NIPS (pp. 1081--1088).
  • 31. }}Morin, F., & Bengio, Y. (2005). Hierarchical Probabilistic Neural Network Language Model. AISTATS.
  • 32. Fernando Pereira, Naftali Tishby, Lillian Lee, Distributional Clustering of English Words, Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, p.183-190, June 22-26, 1993, Columbus, Ohio doi:10.3115/981574.981598
  • 33. Lev Ratinov, Dan Roth, Design Challenges and Misconceptions in Named Entity Recognition, Proceedings of the Thirteenth Conference on Computational Natural Language Learning, June 04-05, 2009, Boulder, Colorado
  • 34. }}Ritter, H., & Kohonen, T. (1989). Self-organizing Semantic Maps. Biological Cybernetics, 241--254.
  • 35. }}Sahlgren, M. (2001). Vector-based Semantic Analysis: Representing Word Meanings based on Random Labels. Proceedings of the Semantic Knowledge Acquisition and Categorisation Workshop, ESSLLI.
  • 36. }}Sahlgren, M. (2005). An Introduction to Random Indexing. Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering (TKE).
  • 37. }}Sahlgren, M. (2006). The Word-space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations Between Words in High-dimensional Vector Spaces. Doctoral Dissertation, Stockholm University.
  • 38. Erik F. Tjong Kim Sang, Sabine Buchholz, Introduction to the CoNLL-2000 Shared Task: Chunking, Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning, September 13-14, 2000, Lisbon, Portugal doi:10.3115/1117601.1117631
  • 39. }}Schwenk, H., & Gauvain, J.-L. (2002). Connectionist Language Modeling for Large Vocabulary Continuous Speech Recognition. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 765--768). Orlando, Florida.
  • 40. Fei Sha, Fernando Pereira, Shallow Parsing with Conditional Random Fields, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, p.134-141, May 27-June 01, 2003, Edmonton, Canada doi:10.3115/1073445.1073473
  • 41. Valentin I. Spitkovsky, Hiyan Alshawi, Daniel Jurafsky, From Baby Steps to Leapfrog: How "Less is More" in Unsupervised Dependency Parsing, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, p.751-759, June 02-04, 2010, Los Angeles, California
  • 42. }}Suzuki, J., & Isozaki, H. (2008). Semi-supervised Sequential Labeling and Segmentation Using Giga-word Scale Unlabeled Data. ACL-08: HLT (pp. 665--673).
  • 43. Jun Suzuki, Hideki Isozaki, Xavier Carreras, Michael Collins, An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, August 06-07, 2009, Singapore
  • 44. }}Turian, J., Ratinov, L., Bengio, Y., & Roth, D. (2009). A Preliminary Evaluation of Word Representations for Named-entity Recognition. NIPS Workshop on Grammar Induction, Representation of Language and Language Learning.
  • 45. Peter D. Turney, Patrick Pantel, From Frequency to Meaning: Vector Space Models of Semantics, Journal of Artificial Intelligence Research, v.37 n.1, p.141-188, January 2010
  • 46. Akira Ushioda, Hierarchical Clustering of Words, Proceedings of the 16th Conference on Computational Linguistics, August 05-09, 1996, Copenhagen, Denmark doi:10.3115/993268.993390
  • 47. }}Väyrynen, J., & Honkela, T. (2005). Comparison of Independent Component Analysis and Singular Value Decomposition in Word Context Analysis. AKRR'05, International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning.
  • 48. }}Väyrynen, J. J., & Honkela, T. (2004). Word Category Maps based on Emergent Features Created by ICA. Proceedings of the STeP'2004 Cognition + Cybernetics Symposium (pp. 173--185). Finnish Artificial Intelligence Society.
  • 49. }}Väyrynen, J. J., Honkela, T., & Lindqvist, L. (2007). Towards Explicit Semantic Features Using Independent Component Analysis. Proceedings of the Workshop Semantic Content Acquisition and Representation (SCAR). Stockholm, Sweden: Swedish Institute of Computer Science.
  • 50. }}Rehůrek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. LREC.
  • 51. Tong Zhang, David Johnson, A Robust Risk Minimization based Named Entity Recognition System, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, p.204-207, May 31, 2003, Edmonton, Canada doi:10.3115/1119176.1119210
  • 52. Hai Zhao, Wenliang Chen, Chunyu Kit, Guodong Zhou, Multilingual Dependency Learning: A Huge Feature Engineering Method to Semantic Dependency Parsing, Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, June 04-04, 2009, Boulder, Colorado

}};


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2010 WordRepresentationsASimpleandGeLev Ratinov
Joseph Turian
Yoshua Bengio
Word Representations: A Simple and General Method for Semi-supervised Learning2010