2010 WordRepresentationsASimpleandGe

(Turian et al., 2010) ⇒ Joseph Turian, Lev Ratinov, and Yoshua Bengio. (2010). “Word Representations: A Simple and General Method for Semi-supervised Learning.” In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.

Subject Headings: Distributional Word Representation.

Notes

Online Resource(s):
- Source code:
  - https://github.com/nlplab/wordreprs.nlplab.org
  - http://metaoptimize.com/projects/wordreprs/

Cited By

Google Scholar: ~ 2,471 Citations.
ACM DL: ~ 181+ 7,390 Citations + Downloads.

Quotes

Abstract

If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word representations improves the accuracy of these baselines. We find further improvements by combining different word representations. You can download our word features, for off-the-shelf use in existing NLP systems, as well as our code, here: http://metaoptimize.com/projects/wordreprs/

…

3 Clustering-based word representations

Another type of word representation is to induce a clustering over words. Clustering methods and distributional methods can overlap. For example, Pereira et al. (1993) begin with a cooccurrence matrix and transform this matrix into a clustering.

3.1 Brown clustering

The Brown algorithm is a hierarchical clustering algorithm which clusters words to maximize the mutual information of bigrams (Brown et al., 1992). So it is a class-based bigram language model. It runs in time [math]\displaystyle{ O(V·K^2) }[/math], where [math]\displaystyle{ V }[/math] is the size of the vocabulary and [math]\displaystyle{ K }[/math] is the number of clusters. The hierarchical nature of the clustering means that we can choose the word class at several levels in the hierarchy, which can compensate for poor clusters of a small number of words. One downside of Brown clustering is that it is based solely on bigram statistics, and does not consider word usage in a wider context.

Brown clusters have been used successfully in a variety of NLP applications: NER (Miller et al., 2004; Liang, 2005; Ratinov & Roth, 2009), PCFG parsing (Candito & Crabbé, 2009), dependency parsing (Koo et al., 2008; Suzuki et al., 2009), and semantic dependency parsing (Zhao et al., 2009).

Martin et al. (1998) presents algorithms for inducing hierarchical clusterings based upon word bigram and trigram statistics. Ushioda (1996) presents an extension to the Brown clustering algorithm, and learn hierarchical clusterings of words as well as phrases, which they apply to POS tagging.

3.2 Other work on cluster-based word representations

Lin and Wu (2009) present a K-means-like non-hierarchical clustering algorithm for phrases, which uses MapReduce.

HMMs can be used to induce a soft clustering, specifically a multinomial distribution over possible clusters (hidden states). Li and McCallum (2005) use an HMM-LDA model to improve POS tagging and Chinese Word Segmentation. Huang and Yates (2009) induce a fully-connected HMM, which emits a multinomial distribution over possible vocabulary words. They perform hard clustering using the Viterbi algorithm. (Alternately, they could keep the soft clustering, with the representation for a particular word token being the posterior probability distribution over the states.) However, the CRF chunker in Huang and Yates (2009), which uses their HMM word clusters as extra features, achieves F1 lower than a baseline CRF chunker (Sha & Pereira, 2003). Goldberg et al. (2009) use an HMM to assign POS tags to words, which in turns improves the accuracy of the PCFG-based Hebrew parser. Deschacht and Moens (2009) use a latent-variable language model to improve semantic role labeling.

…

References

Bengio, Y. (2008). Neural Net Language Models. Scholarpedia, 3, 3881.

3. }}Bengio, Y., Ducharme, R., & Vincent, P. (2001). A Neural Probabilistic Language Model. NIPS.
4. Yoshua Bengio, RÃ©jean Ducharme, Pascal Vincent, Christian Janvin, A Neural Probabilistic Language Model, The Journal of Machine Learning Research, 3, 3/1/2003
5. Yoshua Bengio, JÃ©rÃ´me Louradour, Ronan Collobert, Jason Weston, Curriculum Learning, Proceedings of the 26th Annual International Conference on Machine Learning, p.41-48, June 14-18, 2009, Montreal, Quebec, Canada doi:10.1145/1553374.1553380
6. }}Bengio, Y., & SÃ©Ã©nÃ©cal, J.-S. (2003). Quick Training of Probabilistic Neural Nets by Importance Sampling. AISTATS.
7. David M. Blei, Andrew Y. Ng, Michael I. Jordan, Latent Dirichlet Allocation, The Journal of Machine Learning Research, 3, p.993-1022, 3/1/2003
8. Peter F. Brown, Peter V. DeSouza, Robert L. Mercer, Vincent J. Della Pietra, Jenifer C. Lai, Class-based n-gram Models of Natural Language, Computational Linguistics, v.18 n.4, p.467-479, December 1992
9. Marie Candito, BenoÃ®t CrabbÃ©, Improving Generative Statistical Parsing with Semi-supervised Word Clustering, Proceedings of the 11th International Conference on Parsing Technologies, October 07-09, 2009, Paris, France
10. Ronan Collobert, Jason Weston, A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning, Proceedings of the 25th International Conference on Machine Learning, p.160-167, July 05-09, 2008, Helsinki, Finland doi:10.1145/1390156.1390177
11. Koen Deschacht, Marie-Francine Moens, Semi-supervised Semantic Role Labeling Using the Latent Words Language Model, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1, August 06-07, 2009, Singapore
12. S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, R. Harshman, Using Latent Semantic Analysis to Improve Access to Textual Information, Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, p.281-285, May 15-19, 1988, Washington, D.C., USA doi:10.1145/57167.57214
13. }}Elman, J. L. (1993). Learning and Development in Neural Networks: The Importance of Starting Small. Cognition, 48, 781--799.
14. Yoav Goldberg, Reut Tsarfaty, Meni Adler, Michael Elhadad, Enhancing Unlexicalized Parsing Performance Using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities, Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, p.327-335, March 30-April 03, 2009, Athens, Greece
15. }}Honkela, T. (1997). Self-organizing Maps of Words for Natural Language Processing Applications. Proceedings of the International ICSC Symposium on Soft Computing.
16. }}Honkela, T., Pulkki, V., & Kohonen, T. (1995). Contextual Relations of Words in Grimm Tales, Analyzed by Self-organizing Map. ICANN.
17. Fei Huang, Alexander Yates, Distributional Representations for Handling Sparsity in Supervised Sequence-labeling, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1, August 02-07, 2009, Suntec, Singapore
18. }}Kaski, S. (1998). Dimensionality Reduction by Random Mapping: Fast Similarity Computation for Clustering. IJCNN (pp. 413--418).
19. }}Koo, T., Carreras, X., & Collins, M. (2008). Simple Semi-supervised Dependency Parsing. ACL (pp. 595--603).
20. Vijay Krishnan, Christopher D. Manning, An Effective Two-stage Model for Exploiting Non-local Dependencies in Named Entity Recognition, Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, p.1121-1128, July 17-18, 2006, Sydney, Australia doi:10.3115/1220175.1220316
21. }}Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes, 259--284.
22. Wei Li, Andrew McCallum, Semi-supervised Sequence Modeling with Syntactic Topic Models, Proceedings of the 20th National Conference on Artificial Intelligence, p.813-818, July 09-13, 2005, Pittsburgh, Pennsylvania
23. }}Liang, P. (2005). Semi-supervised Learning for Natural Language. Master's Thesis, Massachusetts Institute of Technology.
24. Dekang Lin, Xiaoyun Wu, Phrase Clustering for Discriminative Learning, Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2, August 02-07, 2009, Suntec, Singapore
25. }}Lund, K., & Burgess, C. (1996). Producing Highdimensional Semantic Spaces from Lexical Co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 28, 203--208.
26. }}Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and Associative Priming in High-dimensional Semantic Space. Cognitive Science Proceedings, LEA (pp. 660--665).
27. Sven Martin, JÃ¶rg Liermann, Hermann Ney, Algorithms for Bigram and Trigram Word Clustering, Speech Communication, v.24 n.1, p.19-37, April 1, 1998 doi:10.1016/S0167-6393(97)00062-9
28. }}Miller, S., Guinness, J., & Zamanian, A. (2004). Name Tagging with Word Clusters and Discriminative Training. HLT-NAACL (pp. 337--342).
29. Andriy Mnih, Geoffrey Hinton, Three New Graphical Models for Statistical Language Modelling, Proceedings of the 24th International Conference on Machine Learning, p.641-648, June 20-24, 2007, Corvalis, Oregon doi:10.1145/1273496.1273577
30. }}Mnih, A., & Hinton, G. E. (2009). A Scalable Hierarchical Distributed Language Model. NIPS (pp. 1081--1088).
31. }}Morin, F., & Bengio, Y. (2005). Hierarchical Probabilistic Neural Network Language Model. AISTATS.
32. Fernando Pereira, Naftali Tishby, Lillian Lee, Distributional Clustering of English Words, Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, p.183-190, June 22-26, 1993, Columbus, Ohio doi:10.3115/981574.981598
33. Lev Ratinov, Dan Roth, Design Challenges and Misconceptions in Named Entity Recognition, Proceedings of the Thirteenth Conference on Computational Natural Language Learning, June 04-05, 2009, Boulder, Colorado
34. }}Ritter, H., & Kohonen, T. (1989). Self-organizing Semantic Maps. Biological Cybernetics, 241--254.
35. }}Sahlgren, M. (2001). Vector-based Semantic Analysis: Representing Word Meanings based on Random Labels. Proceedings of the Semantic Knowledge Acquisition and Categorisation Workshop, ESSLLI.
36. }}Sahlgren, M. (2005). An Introduction to Random Indexing. Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering (TKE).
37. }}Sahlgren, M. (2006). The Word-space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations Between Words in High-dimensional Vector Spaces. Doctoral Dissertation, Stockholm University.
38. Erik F. Tjong Kim Sang, Sabine Buchholz, Introduction to the CoNLL-2000 Shared Task: Chunking, Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning, September 13-14, 2000, Lisbon, Portugal doi:10.3115/1117601.1117631
39. }}Schwenk, H., & Gauvain, J.-L. (2002). Connectionist Language Modeling for Large Vocabulary Continuous Speech Recognition. International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 765--768). Orlando, Florida.
40. Fei Sha, Fernando Pereira, Shallow Parsing with Conditional Random Fields, Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, p.134-141, May 27-June 01, 2003, Edmonton, Canada doi:10.3115/1073445.1073473
41. Valentin I. Spitkovsky, Hiyan Alshawi, Daniel Jurafsky, From Baby Steps to Leapfrog: How "Less is More" in Unsupervised Dependency Parsing, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, p.751-759, June 02-04, 2010, Los Angeles, California
42. }}Suzuki, J., & Isozaki, H. (2008). Semi-supervised Sequential Labeling and Segmentation Using Giga-word Scale Unlabeled Data. ACL-08: HLT (pp. 665--673).
43. Jun Suzuki, Hideki Isozaki, Xavier Carreras, Michael Collins, An Empirical Study of Semi-supervised Structured Conditional Models for Dependency Parsing, Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, August 06-07, 2009, Singapore
44. }}Turian, J., Ratinov, L., Bengio, Y., & Roth, D. (2009). A Preliminary Evaluation of Word Representations for Named-entity Recognition. NIPS Workshop on Grammar Induction, Representation of Language and Language Learning.
45. Peter D. Turney, Patrick Pantel, From Frequency to Meaning: Vector Space Models of Semantics, Journal of Artificial Intelligence Research, v.37 n.1, p.141-188, January 2010
46. Akira Ushioda, Hierarchical Clustering of Words, Proceedings of the 16th Conference on Computational Linguistics, August 05-09, 1996, Copenhagen, Denmark doi:10.3115/993268.993390
47. }}VÃ¤yrynen, J., & Honkela, T. (2005). Comparison of Independent Component Analysis and Singular Value Decomposition in Word Context Analysis. AKRR'05, International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning.
48. }}VÃ¤yrynen, J. J., & Honkela, T. (2004). Word Category Maps based on Emergent Features Created by ICA. Proceedings of the STeP'2004 Cognition + Cybernetics Symposium (pp. 173--185). Finnish Artificial Intelligence Society.
49. }}VÃ¤yrynen, J. J., Honkela, T., & Lindqvist, L. (2007). Towards Explicit Semantic Features Using Independent Component Analysis. Proceedings of the Workshop Semantic Content Acquisition and Representation (SCAR). Stockholm, Sweden: Swedish Institute of Computer Science.
50. }}RehÅ¯rek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. LREC.
51. Tong Zhang, David Johnson, A Robust Risk Minimization based Named Entity Recognition System, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, p.204-207, May 31, 2003, Edmonton, Canada doi:10.3115/1119176.1119210
52. Hai Zhao, Wenliang Chen, Chunyu Kit, Guodong Zhou, Multilingual Dependency Learning: A Huge Feature Engineering Method to Semantic Dependency Parsing, Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, June 04-04, 2009, Boulder, Colorado

}};

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2010 WordRepresentationsASimpleandGe	Lev Ratinov Joseph Turian Yoshua Bengio			Word Representations: A Simple and General Method for Semi-supervised Learning						2010