2009 MatchingReviewsToObjects

Jump to navigation Jump to search

Subject Headings: Entity Mention Normalization, Product Review.


Cited By



We develop a general method to match unstructured text reviews to a structured list of objects. For this, we propose a language model for generating reviews that incorporates a description of objects and a generic review language model. This mixture model gives us a principled method to find, given a review, the object most likely to be the topic of the review. Extensive experiments and analysis on reviews from Yelp show that our language model-based method vastly outperforms traditional tf-idf-based methods.

2. Related Work

Opinion topic identification is the work closest to ours. In a recent paper, Stoyanov and Cardie (2008) approach this problem by treating it as an exercise in topic coreference resolution. Though they have to deal with topic ambiguities and a lack of explicit topic mentions as in our case, their notion of a topic is not driven by a structured listing. There has been some work on fine-grained opinion extraction from reviews (Kobayashi et al., 2004; Yi et al., 2003; Popescu and Etzioni, 2005; Hu and Liu, 2004); see (Pang and Lee, 2008) for a comprehensive survey. Most of this body of work focused on identifying product features of the object under review, rather than identifying the product itself. Note that while a dictionary of products is often more readily available than a dictionary of product features, identifying objects of reviews is non-trivial even with the help of the former. Indeed, it has been reported that lexicon-lookup methods have limited success on general non-product review texts (Stoyanov and Cardie, 2008). In general, this line of work is more rooted in the information extraction literature, where text spans covering the object (or features of the object) were extracted as the first step; in contrast, we do not have an explicit extraction phase. Since the (very extensive) list of candidate objects are given as input, our task is to rank all matching objects, and in this sense is closer in nature to information retrieval tasks. There has been some work on detecting reviews in large-scale collections (Ng et al., 2006; Barbosa et al., 2009); this is a logical step that precedes the review matching step, the topic of our paper.

Language modeling is becoming a powerful paradigm in the realm of information retrieval applications (Ponte and Croft, 1998; Hiemstra, 1998; Song and Croft, 1999; Lafferty and Zhai, 2003; Zhai, 2008). The basic theme behind language modeling is to first postulate a model for each document and for a given query select the document that is most likely to have generated the query; smoothing is an important means to manage data sparsity in language models (Zhai and Lafferty, 2004). As noted earlier, language models developed for IR are unsuitable for our setting. Furthermore, there are opportunities, such as the presence of structure in our data, which we use in this work (Section 3.2). In fact, in a subsequent paper, we show how a language model specific to each attribute can further improve the accuracy of review matching (Dalvi et al., 2009).

Entity matching is a well-studied topic in databases. There are several approaches to entity matching: non-relational approaches, which consider pairwise attribute similarities between entities (Newcombe et al., 1959; Fellegi and Sunter, 1969), relational approaches, which exploit the relationships that exist between entities (Ananthakrishna et al., 2002; Kalashnikov et al., 2005), and collective approaches, which exploit the relationship between various matching decisions, (Bhattacharya and Getoor, 2007; McCallum and Wellner, 2004). The EROCS system (Chakaravarthy et al., 2006), which uses information extraction and entity matching, is closest in spirit to our problem; they, however, employ tf-idf to match, which we show to be significantly sub-optimal in our setting.


  • 1. Rohit Ananthakrishna, Surajit Chaudhuri, Venkatesh Ganti, Eliminating fuzzy duplicates in data warehouses, Proceedings of the 28th International Conference on Very Large Data Bases, p.586-597, August 20-23, 2002, Hong Kong, China
  • 2. Luciano Barbosa, Ravi Kumar, Bo Pang, Andrew Tomkins, For a few dollars less: identifying review pages sans human labels, Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, May 31-June 05, 2009, Boulder, Colorado.
  • 3. Indrajit Bhattacharya, Lise Getoor, Collective entity resolution in relational data, ACM Transactions on Knowledge Discovery from Data (TKDD), v.1 n.1, p.5-es, March 2007 doi:10.1145/1217299.1217304
  • 4. C. Cardie. (1997). Empirical methods in information extraction. AI Magazine, 18(4):65--80.
  • 5. (Chakaravarthy et al., 2006) ⇒ Venkatesan T. Chakaravarthy, Himanshu Gupta, Prasan Roy, and Mukesh Mohania. (2006). “Efficiently Linking Text Documents with Relevant Structured Information.” In: Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB 2006).
  • 6. N. Dalvi, R. Kumar, B. Pang, and A. Tomkins. (2009). A translation model for matching reviews to objects. Manuscript.
  • 7. I. P. Fellegi and A. B. Sunter. 1969. A theory for record linkage. JASIS, 64:1183--1210.
  • 8. D. Hiemstra and W. Kraaij. (1998). Twenty-one at TREC7: Ad-hoc and cross-language track. In: Proceedings. 7th TREC, pages 174--185.
  • 9. Djoerd Hiemstra, A Linguistically Motivated Probabilistic Model of Information Retrieval, Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries, p.569-584, September 21-23, 1998
  • 10. Minqing Hu, Bing Liu, Mining opinion features in customer reviews, Proceedings of the 19th national conference on Artifical intelligence, p.755-760, July 25-29, 2004, San Jose, California
  • 11. D. V. Kalashnikov, S. Mehrotra, and Z. Chen. (2005). Exploiting relationships for domain-independent data cleaning. In: Proceedings. 5th SDM.
  • 12. N. Kobayashi, K. Inui, Y. Matsumoto, K. Tateishi, and T. Fukushima. (2004). Collecting evaluative expressions for opinion extraction. In: Proceedings. 1st IJCNLP, pages 596--605.
  • 13. J. Lafferty and C. Zhai. (2003). Probabilistic relevance models based on document and query generation. In W. B. Croft and J. Lafferty, editors, Language Modeling and Information Retrieval. Academic Publishers.
  • 14. A. McCallum and B. Wellner. (2004). Conditional models of identity uncertainty with application to noun coreference. In: Proceedings. 17th NIPS.
  • 15. H. B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. 1959. Automatic linkage of vital records. Science, 130:954--959.
  • 16. Vincent Ng, Sajib Dasgupta, S. M. Niaz Arifin, Examining the role of linguistic knowledge sources in the automatic identification and classification of reviews, Proceedings of the COLING/ACL on Main conference poster sessions, p.611-618, July 17-18, 2006, Sydney, Australia
  • 17. Bo Pang, Lillian Lee, Opinion Mining and Sentiment Analysis, Foundations and Trends in Information Retrieval, v.2 n.1-2, p.1-135, January 2008 doi:10.1561/1500000011.
  • 18. Jay M. Ponte, W. Bruce Croft, A language modeling approach to information retrieval, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.275-281, August 24-28, 1998, Melbourne, Australia doi:10.1145/290941.291008
  • 19. Ana-Maria Popescu, Oren Etzioni, Extracting product features and opinions from reviews, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.339-346, October 06-08, 2005, Vancouver, British Columbia, Canada doi:10.3115/1220575.1220618.
  • 20. Gerard M. Salton, A. Wong, C. S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, v.18 n.11, p.613-620, Nov. 1975 doi:10.1145/361219.361220
  • 21. Sunita Sarawagi, Information Extraction, Foundations and Trends in Databases, v.1 n.3, p.261-377, March 2008 doi:10.1561/1900000003.
  • 22. Fei Song, W. Bruce Croft, A general language model for information retrieval (poster abstract), Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.279-280, August 15-19, 1999, Berkeley, California, United States doi:10.1145/312624.312698
  • 23. Veselin Stoyanov, Claire Cardie, Topic identification for fine-grained opinion analysis, Proceedings of the 22nd International Conference on Computational Linguistics, p.817-824, August 18-22, 2008, Manchester, United Kingdom
  • 24. Jeonghee Yi, Tetsuya Nasukawa, Razvan Bunescu, Wayne Niblack, Sentiment Analyzer: Extracting Sentiments about a Given Topic using Natural Language Processing Techniques, Proceedings of the Third IEEE International Conference on Data Mining, p.427, November 19-22, 2003.
  • 25. Chengxiang Zhai, John Lafferty, A study of smoothing methods for language models applied to information retrieval, ACM Transactions on Information Systems (TOIS), v.22 n.2, p.179-214, April 2004 doi:10.1145/984321.984322
  • 26. ChengXiang Zhai, Statistical Language Models for Information Retrieval A Critical Review, Foundations and Trends in Information Retrieval, v.2 n.3, p.137-213, March 2008 doi:10.1561/1500000008,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 MatchingReviewsToObjectsBo Pang
Andrew Tomkins
Ravi Kumar
Nilesh Dalvi
Matching Reviews to Objects Using a Language ModelProceedings of the 2009 Conference on Empirical Methods in Natural Language Processinghttp://www.aclweb.org/anthology-new/D/D09/D09-1064.pdf2009