2015 CounterfactualRiskMinimizationL

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Counterfactual Risk Minimization; Feedback Data; Counterfactual Evaluation.

Notes

Cited By

Quotes

Abstract

We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method - called Policy Optimizer for Exponential Models (POEM) - for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.

References

  • 1. Agarwal, Alekh, Hsu, Daniel, Kale, Satyen, Langford, John, Li, Lihong, and Schapire, Robert. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. In Proceedings of the 31st International Conference on Machine Learning, Pp. 1638-1646, 2014.
  • 2. Martin Anthony, Peter L. Bartlett, Neural Network Learning: Theoretical Foundations, Cambridge University Press, New York, NY, 2009
  • 3. Alina Beygelzimer, John Langford, The Offset Tree for Learning with Partial Labels, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, June 28-July 01, 2009, Paris, France
  • 4. Avrim Blum, Adam Kalai, John Langford, Beating the Hold-out: Bounds for K-fold and Progressive Cross-validation, Proceedings of the Twelfth Annual Conference on Computational Learning Theory, p.203-208, July 07-09, 1999, Santa Cruz, California, USA
  • 5. Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, Ed Snelson, Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising, The Journal of Machine Learning Research, v.14 n.1, p.3207-3260, January 2013
  • 6. John Duchi, Elad Hazan, Yoram Singer, Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, The Journal of Machine Learning Research, 12, p.2121-2159, 2/1/2011
  • 7. Galichet, Nicolas, Sebag, Michèle, and Teytaud, Olivier. Exploration Vs Exploitation Vs Safety: Risk-aware Multiarmed Bandits. In Asian Conference on Machine Learning, Pp. 245-260, 2013.
  • 8. Javier García, Fernando Fernández, Safe Exploration of State and Action Spaces in Reinforcement Learning, Journal of Artificial Intelligence Research, v.45 n.1, p.515-564, September 2012
  • 9. Katja Hofmann, Anne Schuth, Shimon Whiteson, Maarten De Rijke, Reusing Historical Interaction Data for Faster Online Learning to Rank for IR, Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, February 04-08, 2013, Rome, Italy
  • 10. Ionides, Edward L. Truncated Importance Sampling. Journal of Computational and Graphical Statistics, 17(2): 295-311, 2008.
  • 11. John D. Lafferty, Andrew McCallum, Fernando C. N. Pereira, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of the Eighteenth International Conference on Machine Learning, p.282-289, June 28-July 01, 2001
  • 12. John Langford, Tong Zhang, The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits, Proceedings of the 20th International Conference on Neural Information Processing Systems, p.817-824, December 03-06, 2007, Vancouver, British Columbia, Canada
  • 13. John Langford, Alexander Strehl, Jennifer Wortman, Exploration Scavenging, Proceedings of the 25th International Conference on Machine Learning, p.528-535, July 05-09, 2008, Helsinki, Finland
  • 14. Langford, John, Li, Lihong, and Dudík, Miroslav. Doubly Robust Policy Evaluation and Learning. In Proceedings of the 28th International Conference on Machine Learning, Pp. 1097-1104, 2011.
  • 15. Lewis, Adrian S. and Overton, Michael L. Nonsmooth Optimization via Quasi-newton Methods. Mathematical Programming, 141(1-2):135-163, 2013.
  • 16. Lihong Li, Wei Chu, John Langford, Robert E. Schapire, A Contextual-bandit Approach to Personalized News Article Recommendation, Proceedings of the 19th International Conference on World Wide Web, April 26-30, 2010, Raleigh, North Carolina, USA
  • 17. Lihong Li, Wei Chu, John Langford, Xuanhui Wang, Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms, Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, February 09-12, 2011, Hong Kong, China
  • 18. Li, Lihong, Chen, Shunbao, Kleban, Jim, and Gupta, Ankur. Counterfactual Estimation and Optimization of Click Metrics for Search Engines. CoRR, Abs/1403.1891, 2014.
  • 19. Li, Lihong, Munos, Remi, and Szepesvari, Csaba. Toward Minimax Off-policy Value Estimation. In Proceedings of the 18th International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
  • 20. Mary, Jérémie, Preux, Philippe, and Nicol, Olivier. Improving Offline Evaluation of Contextual Bandit Algorithms via Bootstrapping Techniques. In Proceedings of the 31st International Conference on Machine Learning, Pp. 172-180, 2014.
  • 21. Maurer, Andreas and Pontil, Massimiliano. Empirical Bernstein Bounds and Sample-variance Penalization. In Proceedings of the 22nd Conference on Learning Theory, 2009.
  • 22. Owen, Art B. Monte Carlo Theory, Methods and Examples. 2013.
  • 23. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay, Scikit-learn: Machine Learning in Python, The Journal of Machine Learning Research, 12, p.2825-2830, 2/1/2011
  • 24. Rosenbaum, Paul R. and Rubin, Donald B. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, 70(1):41-55, 1983.
  • 25. Shivaswamy, Pannagadatta K. and Joachims, Thorsten. Multi-armed Bandit Problems with History. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, Pp. 1046-1054, 2012.
  • 26. Alexander L. Strehl, John Langford, Lihong Li, Sham M. Kakade, Learning from Logged Implicit Exploration Data, Proceedings of the 23rd International Conference on Neural Information Processing Systems, p.2217-2225, December 06-09, 2010, Vancouver, British Columbia, Canada
  • 27. Philip S. Thomas, Georgios Theocharous, Mohammad Ghavamzadeh, High Confidence Off-policy Evaluation, Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, p.3000-3006, January 25-30, 2015, Austin, Texas
  • 28. Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, Yasemin Altun, Support Vector Machine Learning for Interdependent and Structured Output Spaces, Proceedings of the Twenty-first International Conference on Machine Learning, p.104, July 04-08, 2004, Banff, Alberta, Canada
  • 29. van Den Burg, G.J.J. and Groenen, P.J.F. GenSVM: A Generalized Multiclass Support Vector Machine. Technical Report EI 2014-33, Erasmus University Rotterdam, Erasmus School of Economics (ESE), Econometric Institute, 2014.
  • 30. Vapnik, V. Statistical Learning Theory. Wiley, Chichester, GB, 1998.
  • 31. Jin Yu, S.V.N. Vishwanathan, Simon Günter, Nicol N. Schraudolph, A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning, The Journal of Machine Learning Research, 11, p.1145-1200, 3/1/2010
  • 32. Bianca Zadrozny, John Langford, Naoki Abe, Cost-Sensitive Learning by Cost-Proportionate Example Weighting, Proceedings of the Third IEEE International Conference on Data Mining, p.435, November 19-22, 2003;


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 CounterfactualRiskMinimizationLThorsten Joachims
Adith Swaminathan
Counterfactual Risk Minimization: Learning from Logged Bandit Feedback