2004 ProbabilityProductKernels

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Probability Product Kernels.

Notes

Cited By

2015

Quotes

Abstract

The advantages of discriminative learning algorithms and kernel machines are combined with generative modeling using a novel kernel between distributions. In the probability product kernel, data points in the input space are mapped to distributions over the sample space and a general inner product is then evaluated as the integral of the product of pairs of distributions. The kernel is straightforward to evaluate for all exponential family models such as multinomials and Gaussians and yields interesting nonlinear kernels. Furthermore, the kernel is computable in closed form for latent distributions such as mixture models, hidden Markov models and linear dynamical systems. For intractable models, such as switching linear dynamical systems, structured mean-field approximations can be brought to bear on the kernel evaluation. For general distributions, even if an analytic expression for the kernel is not feasible, we show a straightforward sampling method to evaluate it. Thus, the kernel permits discriminative learning methods, including support vector machines, to exploit the properties, metrics and invariances of the generative models we infer from each datum. Experiments are shown using multinomial models for text, hidden Markov models for biological data sets and linear dynamical systems for time series data.

References

  • 1. Y. Bengio, P. Frasconi, Input-output HMMs for Sequence Processing, IEEE Transactions on Neural Networks, v.7 n.5, p.1231-1249, September 1996 doi:10.1109/72.536317
  • 2. A. Bhattacharyya. On a Measure of Divergence Between Two Statistical Populations Defined by their Probability Distributions. Bull. Calcutta Math Soc., 1943.
  • 3. M. Collins and N. Duffy. Convolution Kernels for Natural Language. In Neural Information Processing Systems 14, 2002.
  • 4. C. Cortes, P. Haffner, and M. Mohri. Rational Kernels. In Neural Information Processing Systems 15, 2002.
  • 5. Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey, Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, p.318-329, June 21-24, 1992, Copenhagen, Denmark doi:10.1145/133160.133214
  • 6. R. Davidson and J. MacKinnon. Estimation and Inference in Econometrics. Oxford University Press, 1993.
  • 7. Zoubin Ghahramani, Michael I. Jordan, Factorial Hidden Markov Models, Machine Learning, v.29 n.2-3, p.245-273, Nov./Dec. 1997 doi:10.1023/A:1007425814087
  • 8. M. Goldzmidt and M. Sahami. A Probabilistic Approach to Full-text Document Clustering. Technical Report, Stanford University, 1998. Database Group Publication 60.
  • 9. T. Hastie, R. Tibshirani, and Friedman J. H. The Elements of Statistical Learning. Springer Verlag, 2001.
  • 10. D. Haussler. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, University of California at Santa Cruz, 1999.
  • 11. T. Jaakkola. Tutorial on Variational Approximation Methods. In Advanced Mean Field Methods: Theory and Practice. MIT Press, 2000.
  • 12. Tommi S. Jaakkola, David Haussler, Exploiting Generative Models in Discriminative Classifiers, Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, p.487-493, July 1999
  • 13. T. Jaakkola, M. Meila, and T. Jebara. Maximum Entropy Discrimination. In Neural Information Processing Systems 12, 1999.
  • 14. T. Jebara and R. Kondor. Bhattacharyya and Expected Likelihood Kernels. In Conference on Learning Theory, 2003.
  • 15. Thorsten Joachims, Nello Cristianini, John Shawe-Taylor, Composite Kernels for Hypertext Categorisation, Proceedings of the Eighteenth International Conference on Machine Learning, p.250-257, June 28-July 01, 2001
  • 16. M. Jordan and C. Bishop. Introduction to Graphical Models. In Progress, 2004.
  • 17. H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized Kernels Between Labeled Graphs. In Machine Learning: Tenth International Conference, ICML, 2003.
  • 18. R. Kondor and T. Jebara. A Kernel Between Sets of Vectors. In Machine Learning: Tenth International Conference, 2003.
  • 19. J. Lafferty and G. Lebanon. Information Diffusion Kernels. In Neural Information Processing Systems, 2002.
  • 20. C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch String Kernels for SVM Protein Classification. In Neural Information Processing Systems, 2002.
  • 21. P. J. Moreno, P. P. Ho, and N. Vasconcelos. A Kullback-Leibler Divergence based Kernel for SVM Classification in Multimedia Applications. In Neural Information Processing Systems, 2004.
  • 22. C. Ong, A. Smola, and R. Williamson. Superkernels. In Neural Information Processing Systems, 2002.
  • 23. Cheng Soon Ong, Xavier Mary, Stéphane Canu, Alexander J. Smola, Learning with Non-positive Kernels, Proceedings of the Twenty-first International Conference on Machine Learning, p.81, July 04-08, 2004, Banff, Alberta, Canada doi:10.1145/1015330.1015443
  • 24. V. Pavlovic, J. M. Rehg, and J. MacCormick. Learning Switching Linear Models of Human Motion. In Neural Information Processing Systems 13, Pages 981-987, 2000.
  • 25. Judea Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publishers Inc., San Francisco, CA, 1988
  • 26. John C. Platt, Fast Training of Support Vector Machines Using Sequential Minimal Optimization, Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, 1999
  • 27. Bernhard Scholkopf, Alexander J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, MIT Press, Cambridge, MA, 2001
  • 28. B. Schölkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.
  • 29. R. H. Shumway and D. S. Stoffer. An Approach to Time Series Smoothing and Forecasting Using the EM Algorithm. J. of Time Series Analysis, 3(4):253-264, 1982.
  • 30. B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman and Hall: London., 1986.
  • 31. F. Topsoe. Some Inequalities for Information Divergence and Related Measures of Discrimination. J. of Inequalities in Pure and Applied Mathematics, 2(1), 1999.
  • 32. K. Tsuda, T. Kin, and K. Asai. Marginalized Kernels for Biological Sequences. Bioinformatics, 18 (90001):S268-S275, 2002.
  • 33. V. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.
  • 34. S. V. N. Vishawanathan and A. J. Smola. Fast Kernels for String and Tree Matching. In Neural Information Processing Systems 15, 2002.
  • 35. C. Watkins. Advances in Kernel Methods, Chapter Dynamic Alignment Kernels. MIT Press, 2000.

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 ProbabilityProductKernelsTony Jebara
Risi Kondor
Andrew Howard
Probability Product Kernels2004