2001 MiningTheWebForSynonyms

(Turney, 2001) ⇒ Peter D. Turney. (2001). “Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL.” In: Proceedings of the 12th European Conference on Machine Learning (ECML 2001). doi:10.1007/3-540-44795-4_42

Subject Headings: Pointwise Mutual Information and Information Retrieval, Synonym Extraction Algorithm, Lexical Semantic Similarity Function.

Notes

Cited By

2002

(Turney, 2002) ⇒ Peter D. Turney. (2002). “Thumbs up or Thumbs Down?: Semantic orientation applied to unsupervised classification of reviews.” In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL 2002). doi:10.3115/1073083.1073153
- QUOTE: The PMI-IR algorithm is employed to estimate the semantic orientation of a phrase (Turney, 2001). PMI-IR uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words or phrases.

Quotes

Abstract

This paper presents a simple unsupervised learning algorithm for recognizing synonyms, based on statistical data acquired by querying a Web search engine. The algorithm, called PMI-IR, uses Pointwise Mutual Information (PMI) and Information Retrieval (IR) to measure the similarity of pairs of words. PMI-IR is empirically evaluated using 80 synonym test questions from the Test of English as a Foreign Language (TOEFL) and 50 synonym test questions from a collection of tests for students of English as a Second Language (ESL). On both tests, the algorithm obtains a score of 74%. PMI-IR is contrasted with Latent Semantic Analysis (LSA), which achieves a score of 64% on the same 80 TOEFL questions. The paper discusses potential applications of the new unsupervised learning algorithm and some implications of the results for LSA and LSI (Latent Semantic Indexing).

References

1. Kenneth W. Church, Hanks, P.: Word Association Norms, Mutual Information and Lexicography. In: Proceedings of the 27th Annual Conference of the Association of Computational Linguistics, (1989) 76-83.
2. Kenneth W. Church, Gale, W., Hanks, P., Hindle, D.: Using Statistics in Lexical Analysis. In: Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. New Jersey: Lawrence Erlbaum (1991) 115-164.
3. AltaVista, AltaVista Company, Palo Alto, California, http://www.altavista.com/.
4. Test of English as a Foreign Language (TOEFL), Educational Testing Service, Princeton, New Jersey, http://www.ets.org/.
5. Tatsuki, D.: Basic 2000 Words - Synonym Match 1. In: Interactive JavaScript Quizzes for ESL Students, http://www.aitech.ac.jp/~iteslj/quizzes/js/dt/mc-2000-01syn.html (1998).
6. Thomas K. Landauer, Dumais, S.T.: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of the Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104 (1997) 211-240.
7. Deerwester, S., Dumais, S.T., Furnas, G.W., Thomas K. Landauer, Harshman, R.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41 (1990) 391-407.
8. Berry, M.W., Dumais, S.T., Letsche, T.A.: Computational Methods for Intelligent Information Access. Proceedings of Supercomputing ’95, San Diego, California, (1995).
9. Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing. Cambridge, Massachusetts: MIT Press (1999).
10. John Rupert Firth: A Synopsis of Linguistic Theory 1930-1955. In Studies in Linguistic Analysis, pp. 1-32. Oxford: Philological Society (1957). Reprinted in F.R. Palmer (ed.), Selected Papers of John Rupert Firth 1952-1959, London: Longman (1968).
11. AltaVista: AltaVista Advanced Search Cheat Sheet, AltaVista Company, Palo Alto, California, http://doc.altavista.com/adv_search/syntax.html (2001).
12. Christiane Fellbaum (ed.): WordNet: An Electronic Lexical Database. Cambridge, Massachusetts: MIT Press (1998). For more information: http://www.cogsci.princeton.edu/~wn/.
13. Haase, K.: Interlingual BRICO. IBM Systems Journal, 39 (2000) 589-596. For more information: http://www.framerd.org/brico/.
14. Vossen, P. (ed.): EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Dordrecht, Netherlands: Kluwer (1998). See: http://www.hum.uva.nl/~ewn/. 15. Turney, P.D.: Learning Algorithms for Keyphrase Extraction. Information Retrieval, 2 (2000) 303-336.
16. Gregory Grefenstette: Finding Semantic Similarity in Raw Text: The Deese Antonyms. In: R. Goldman, P. Norvig, Eugene Charniak and B. Gale (eds.), Working Notes of the AAAI Fall Symposium on Probabilistic Approaches to Natural Language. AAAI Press (1992) 61-65.
17. Hinrich Schütze: Word Space. In: S.J. Hanson, J.D. Cowan, and C.L. Giles (eds.), Advances in Neural Information Processing Systems 5, San Mateo California: Morgan Kaufmann (1993) 895-902.
18. Dekang Lin: Automatic Retrieval and Clustering of Similar Words. In: Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, Montreal (1998) 768-773.
19. Richardson, R., Smeaton, A., Murphy, J.: Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words. In: Proceedings of AICS Conference. Trinity College, Dublin (1994).
20. Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in ISA Hierarchies. Journal of Documentation, 49 (1993) 188-207.
21. Philip Resnik: Semantic Similarity in a Taxonomy: An Information-based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research, 11 (1998) 95-130.
22. Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proceedings of the 10th International Conference on Research on Computational Linguistics, Taiwan, (1997).
23. Brin, S., Rajeev Motwani, Ullman, J., Tsur, S.: Dynamic Itemset Counting and Implication Rules for Market Basket Data. In: Proceedings of the 1997 ACM-SIGMOD International Conference on the Management of Data (1997) 255-264.
24. Sullivan, D.: Search Engine Sizes. SearchEngineWatch.com, internet.com Corporation, Darien, Connecticut, http://searchenginewatch.com/reports/sizes.html (2000).
25. (Papadimitriou et al., 1998) ⇒ C. H. Papadimitriou, Prabhakar Raghavan, H. Tamaki, and S. Vempala. (1998) "Latent Semantic Indexing: A Probabilistic Analysis.” In: Proceedings of the Seventeenth ACM-SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.
26. Karen Spärck Jones: Comparison Between TREC2 and TREC3. In: D. Harman (ed.), The Third Text REtrieval Conference (TREC3), National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland (1994) C1-C4.
27. Buckley, C., Gerard M. Salton, Allan, J., Singhal, A.: Automatic Query Expansion Using SMART: TREC 3. In: The Third Text REtrieval Conference (TREC3), D. Harman (ed.), National Institute of Standards and Technology Special Publication 500-226, Gaithersburg, Maryland (1994) 69-80.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2001 MiningTheWebForSynonyms	Peter D. Turney			Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL		ECML 2001	http://arxiv.org/ftp/cs/papers/0212/0212033.pdf	10.1007/3-540-44795-4_42		2001

2001 MiningTheWebForSynonyms

Notes

Cited By

2002

Quotes

Abstract

References

Navigation menu

Search