- (Ng & Jordan, 2001) ⇒ Andrew Y. Ng, and Michael I. Jordan. (2001). “On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes.” In: Proceeding of NIPS Conference, 14 (NIPS 2001).
- ~561 http://scholar.google.com/scholar?q=.+%22On+Discriminative+vs.+Generative+Classifiers%3A+A+comparison+of+logistic+regression+and+naive+Bayes%22
- (Bouchard & Triggs, 2004) ⇒ Guillaume Bouchard, and Bill Triggs. (2004). “The Trade-off Between Generative and Discriminative Classifiers.” In: Proceedings of COMPSTAT 2004.
We compare discriminative and generative learning as typified by logistic regression and naive Bayes. We show, contrary to a widely-held belief that discriminative classifiers are almost always to be preferred, that there can often be two distinct regimes of performance as the training set size is increased, one in which each algorithm does better. This stems from the observation|which is borne out in repeated experiments|that while discriminative learning has lower asymptotic error, a generative classifier may also approach its (higher) asymptotic error much faster.
Generative classifiers learn a model of the joint probability, p (x; y), of the inputs x and the label y, and make their predictions by using Bayes rules to calculate [math]p (y \ vert x) [/math], and then picking the most likely label y. Discriminative classifiers model the posterior p (yjx) directly, or learn a direct map from inputs x to the class labels. There are several compelling reasons for using discriminative rather than generative classifiers, one of which, succinctly articulated by Vapnik , is that \ one should solve the [classification] problem directly and never solve a more general problem as an intermediate step [such as modeling [math]p (x \ vert y) [/math]]. " Indeed, leaving aside computational issues and matters such as handling missing data, the prevailing consensus seems to be that discriminative classifiers are almost always to be preferred to generative ones.
Another piece of prevailing folk wisdom is that the number of examples needed to t a model is often roughly linear in the number of free parameters of a model. This has its theoretical basis in the observation that for \ many " models, the VC dimension is roughly linear or at most some low-order polynomial in the number of parameters (see, e.g., [1, 3]), and it is known that sample complexity in the discriminative setting is linear in the VC dimension .
In this paper, we study empirically and theoretically the extent to which these beliefs are true. A parametric family of probabilistic models p (x; y) can be t either to optimize the joint likelihood of the inputs and the labels, or t to optimize the conditional likelihood p (yjx), or even t to minimize the 0-1 training error obtained by thresholding [math]p(y \mid x)[/math] to make predictions. Given a classifier hGen t according to the first criterion, and a model h Dis t according to either the second or the third criterion (using the same parametric family of models), we call h Gen and hDis a Generative-Discriminative pair. For example, if [math]p (x \ vert y) [/math] is Gaussian and p (y) is multinomial, then the corresponding Generative-Discriminative pair is Normal Discriminant Analysis and logistic regression. Similarly, for the case of discrete inputs it is also well known that the naive Bayes classifier and logistic regression form a Generative-Discriminative pair [4, 5].
To compare generative and discriminative learning, it seems natural to focus on such pairs. In this paper, we consider the naive Bayes model (for both discrete and continuous inputs) and its discriminative analog, logistic regression / linear classification, and show: (a) The generative model does indeed have a higher asymptotic error (as the number of training examples becomes large) than the discriminative model, but (b) The generative model may also approach its asymptotic error much faster than the discriminative model - possibly with a number of training examples that is only logarithmic, rather than linear, in the number of parameters. This suggests|and our empirical results strongly support|that, as the number of training examples is increased, there can be two distinct regimes of performance, the first in which the generative model has already approached its asymptotic error and is thus doing better, and the second in which the discriminative model approaches its lower asymptotic error and does better.
|2001 OnDiscriminativeVsGenerativeClassifiers||Andrew Y. Ng|
Michael I. Jordan
|On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes||Proceeding of NIPS Conference, 14||http://books.nips.cc/papers/files/nips14/AA28.pdf||2001|
|Author||Andrew Y. Ng + and Michael I. Jordan +|
|journal||Proceeding of NIPS Conference, 14 +|
|title||On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes +|