2004 AStudyofSmoothingMethodsforLang

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Unsmoothed Maximum-Likelihood Language Model, Maximum-Likelihood Language Model.

Notes

Cited By

Quotes

Abstract

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. The basic idea of these approaches is to estimate a language model for each document, and to then rank documents by the likelihood of the query according to the estimated language model. A central issue in language model estimation is smoothing, the problem of adjusting the maximum likelihood estimator to compensate for data sparseness. In this article, we study the problem of language model smoothing and its influence on retrieval performance. We examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections. Experimental results show that not only is the retrieval performance generally sensitive to the smoothing parameters, but also the sensitivity pattern is affected by the query type, with performance being more sensitive to smoothing for [verbose queri]]es than for keyword queries. Verbose queries also generally require more aggressive smoothing to achieve optimal performance. This suggests that smoothing plays two different role --- to make the estimated document language model more accurate and to " explain " the noninformative words in the query. In order to decouple these two distinct roles of smoothing, we propose a two-stage smoothing strategy, which yields better sensitivity patterns and facilitates the setting of smoothing parameters automatically. We further propose methods for estimating the smoothing parameters automatically. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to --- or better than --- the best results achieved using a single smoothing method and exhaustive parameter search on the test data.

References

  • 1. Adam Berger, John Lafferty, Information Retrieval As Statistical Translation, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.222-229, August 15-19, 1999, Berkeley, California, USA doi:10.1145/312624.312681
  • 2. Chen, S. F. and Goodman, J. 1998. An Empirical Study of Smoothing Techniques for Language Modeling. Tech. Rep. TR-10-98, Harvard University.
  • 3. Norbert Fuhr, Probabilistic Models in Information Retrieval, The Computer Journal, v.35 n.3, p.243-255, June 1992 doi:10.1093/comjnl/35.3.243
  • 4. Good, I. J. 1953. The Population Frequencies of Species and the Estimation of Population Parameters. Biometrika 40, Parts 3, 4, 237--264.
  • 5. Hiemstra, D. and Kraaij, W. 1999. Twenty-one at TREC-7: Ad-hoc and Cross-language Track. In: Proceedings of 7th Text REtrieval Conference (TREC-7). 227--238.
  • 6. Jelinek, F. and Mercer, R. 1980. Interpolated Estimation of Markov Sourceparameters from Sparse Data. In Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds. 381--402.
  • 7. Katz, S. M. 1987. Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer. IEEE Trans. Acoustics, Speech and Signal Processing (ASSP) 35 400--401.
  • 8. Kneser, R. and Ney, H. 1995. Improved Backing-off for M-gram Language Modeling. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE Computer Society Press, Los Alamitos, Calif., 181--184.
  • 9. Wessel Kraaij, Thijs Westerveld, Djoerd Hiemstra, The Importance of Prior Probabilities for Entry Page Search, Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, August 11-15, 2002, Tampere, Finland doi:10.1145/564376.564383
  • 10. K. L. Kwok, M. Chan, Improving Two-stage Ad-hoc Retrieval for Short Queries, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.250-256, August 24-28, 1998, Melbourne, Australia doi:10.1145/290941.291003
  • 11. John Lafferty, Chengxiang Zhai, Document Language Models, Query Models, and Risk Minimization for Information Retrieval, Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.111-119, September 2001, New Orleans, Louisiana, USA doi:10.1145/383952.383970
  • 12. Victor Lavrenko, W. Bruce Croft, Relevance based Language Models, Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.120-127, September 2001, New Orleans, Louisiana, USA doi:10.1145/383952.383972
  • 13. MacKay, D. and Peto, L. 1995. A Hierarchical Dirichlet Language Model. Nat. Lang. Eng. 1, 3, 289--307.
  • 14. David R. H. Miller, Tim Leek, Richard M. Schwartz, A Hidden Markov Model Information Retrieval System, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.214-221, August 15-19, 1999, Berkeley, California, USA doi:10.1145/312624.312680
  • 15. Ney, H., Essen, U., and Kneser, R. 1994. On Structuring Probabilistic Dependencies in Stochastic Language Modeling. Comput. Speech Lang. 8, 1--38.
  • 16. Hermann Ney, Ute Essen, Reinhard Kneser, On the Estimation of 'Small' Probabilities by Leaving-One-Out, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.17 n.12, p.1202-1212, December 1995 doi:10.1109/34.476512
  • 17. Jay Michael Ponte, A Language Modeling Approach to Information Retrieval, University of Massachusetts, Amherst, MA, 1998
  • 18. Jay M. Ponte, W. Bruce Croft, A Language Modeling Approach to Information Retrieval, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.275-281, August 24-28, 1998, Melbourne, Australia doi:10.1145/290941.291008
  • 19. S. E. Robertson, C. J. Van Rijsbergen, M. F. Porter, Probabilistic Models of Indexing and Searching, Proceedings of the 3rd Annual ACM Conference on Research and Development in Information Retrieval, p.35-56, June 23-27, 1980, Cambridge, England
  • 20. Robertson, S. E., Walker, S., Jones, S., M.Hancock-Beaulieu, M., and Gatford, M. 1995. Okapi at TREC-3. In: Proceedings of the Third Text REtrieval Conference (TREC-3), D. K. Harman, Ed. 109--126.
  • 21. Gerard Salton, Christopher Buckley, Term-weighting Approaches in Automatic Text Retrieval, Information Processing and Management: An International Journal, v.24 n.5, p.513-523, 1988 doi:10.1016/0306-4573(88)90021-0
  • 22. Salton, G. and Buckley, C. 1990. Improving Retrieval Performance by Relevance Feedback. J. Amer. Soc. Inf. Sci. 44, 4, 288--297.
  • 23. G. Salton, A. Wong, C. S. Yang, A Vector Space Model for Automatic Indexing, Communications of the ACM, v.18 n.11, p.613-620, Nov. 1975 doi:10.1145/361219.361220
  • 24. Amit Singhal, Chris Buckley, Mandar Mitra, Pivoted Document Length Normalization, Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.21-29, August 18-22, 1996, Zurich, Switzerland doi:10.1145/243199.243206
  • 25. Fei Song, W. Bruce Croft, A General Language Model for Information Retrieval (poster Abstract), Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.279-280, August 15-19, 1999, Berkeley, California, USA doi:10.1145/312624.312698
  • 26. Karen Sparck Jones, Peter Willett, Readings in Information Retrieval, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1997
  • 27. van Rijsbergen, C. J. 1986. A Non-classical Logic for Information Retrieval. Comput. J. 29, 6, 481--485.
  • 28. Ellen M. Voorhees, Donna Harman, The Text Retrieval Conferences (TRECS), Proceedings of a Workshop on Held at Baltimore, Maryland: October 13-15, 1998, October 13-15, 1998, Baltimore, Maryland doi:10.3115/1119089.1119127
  • 29. S. K. M. Wong, Y. Y. Yao, On Modeling Information Retrieval with Probabilistic Inference, ACM Transactions on Information Systems (TOIS), v.13 n.1, p.38-68, Jan. 1995 doi:10.1145/195705.195713
  • 30. Chengxiang Zhai, John Lafferty, Model-based Feedback in the Language Modeling Approach to Information Retrieval, Proceedings of the Tenth International Conference on Information and Knowledge Management, October 05-10, 2001, Atlanta, Georgia, USA doi:10.1145/502585.502654;


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 AStudyofSmoothingMethodsforLangJohn D. Lafferty
ChengXiang Zhai
A Study of Smoothing Methods for Language Models Applied to Information Retrieval10.1145/984321.9843222004