# GM-RKBβ

## Quotes

### Abstract

Given any generative classifier based on an inexact density model, we can define a discriminative counterpart that reduces its asymptotic error rate. We introduce a family of classifiers that interpolate the two approaches, thus providing a new way to compare them and giving an estimation procedure whose classification performance is well balanced between the bias of generative classifiers and the variance of discriminative ones. We show that an intermediate trade-o between the two strategies is often preferable, both theoretically and in experiments on real data.

### Introduction

In supervised classification, inputs $x$ and their labels $y$ arise from an unknown joint probability $p(x,y)$. If we can approximate $p(x,y)$ using a parametric family of models $G = \{p_θ(x,y),\theta \in \Theta\}$, then a natural classifier is obtained by first estimating the class-conditional densities, then classifying each new data point to the class with highest posterior probability. This approach is called generative classification.

However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density $p(y \vert x)$. Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution $p(x)$. Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than.

Conversely, the generative approach converges to the best model for the joint distribution $p(x,y)$ but the resulting conditional density is usually a biased classifier unless its $p_\theta(x)$ part is an accurate model for $p(x)$. In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that minimizes the negative log-likelihood classification loss against the true density $p(x,y)$ [2]. For finite sample sizes, there is a bias-variance tradeoff and it is less obvious how to choose between generative and discriminative classifiers.

In this paper, we will first consider the parameter estimation problem, focusing on the theoretical distinction between generative and discriminative classifiers. Then we propose a new technique for combining the two classifiers: the Generative-Discriminative Trade-off (GDT) estimate. It is based on a continuous class of cost functions that interpolate smoothly between the generative strategy and the discriminative one. Our method assumes a joint density based parametrization $p_\theta(x,y)$, but uses this to model the conditional density $p(x \vert y)$. The goal is to find the parameters that maximize classification performance on the underlying population, but we do this by defining a cost function that is intermediate between the joint and the conditional log-likelihoods and optimizing this on training and validation sets.

Given that the generative model based on maximum likelihood (ML) produces minimum variance — but possibly biasedparameter estimates, while the discriminative one gives the best asymptotic performance, there are good reasons for thinking that an intermediate method such as the GDT estimate should be preferred. We illustrate this on simulations and on real datasets.

### 2 Preliminaries

...

Using independent training samples $\{x_i,y_i\}, i=1,...,n, x_i \in \mathbb{R}^d$,and $y_i \in \{1,...,K\}$ sampled from the unknown distribution p(x,y), we aim to find the rule that gives the lowest error rate on new data. This is closely related to estimating the conditional probability $p(y \vert x)$.

For each of the $K$classes, the class-conditional probability $p(x \vert y=k)$ is modeled by a parametric model f_k with parameters $\theta_k$. ...

...

Generative classifier. Given data ...

...

Discriminative classifier. Let ...

### 3 Between Generative and Discriminative classifiers

To get a natural trade-off between the two approaches, we can introduce a new objective function L_\lambda based on a parameter \lambda \in [0,1] that interpolates continuously between the discriminative and generative objective functions:

...

### 4 Simulations

To illustrate the behavior of the GDT method, we study its performance on two synthetic test problems. We define the true distributions of the data as follows: In the first experiment, the class conditional probabilities are gaussian with identity covariance matrix and mean ...

...

### 5 Experiments

We tried our classification method on some of the publically available Statlog datasets. In our implementation of the GDT estimates, the parameter dimension is limited due to the size of the optimization problem (7). …

...

### Conclusion

In this study, the relationship between generative and discriminative classifiers has been clarified: they correspond to two different maximizations in the parameter space. By interpolating linearly between the two objective functions, we introduced the GDT estimator. This can be seen either as a less biased variant version of the discriminative solution, or as an improvement of the generative classifier. The regularization is "natural" in the sense that the parameters are encouraged to fit the inputs. Our preliminary results on real data showed that the intermediate model often gives better classification performances than the discriminative and generative classifiers.

The real interest of the GDT estimate resides in its application to generative models. Probabilistic models already exist in many areas: time series models, mixed models and graphical models - including Markov Random Fields and Hidden Markov Models - are examples of widely used generative models. When class-conditional probabilities are modelled generatively, then the GDT estimator should often improve the classification performances.

Currently, the main difficulty with the GDT method is the choice of the tuning parameter, as this requires an expensive cross-validation computation. We believe that more computationally efficient criteria can be developed by analyzing the solutions on the training set, in the spirit of the Bayesian Information Criterion [7].

}}

## References

• [1] Devroye L., Gy?orfi L. & Lugosi L. (1997) A probabilistic Theory of Pattern Recognition. pp. 270-276 New York: Springer-Verlag.
• [2] Bradley Efron (1975) The efficiency of logistic regression compared to Normal Discriminant Analysis. Journ. of the Amer. Statist. Assoc., 70:892-898.
• [3] Ng A.Y. & Michael I. Jordan. (2001) On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In Thomas G. Dietterich, S. Becker and Zoubin Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, pp. 609-616. Cambridge, MA: MIT Press.
• [4] Rubinstein Y.D. & Hastie T. (1997) Discriminative vs. informative learning. In: Proceedings of the Third International Conference on Knowledge and Data Mining, pp. 49-53. AAAI Press.
• [5] Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, pp. 461-464.
• [6] Vapnick, V.N. (1998) Statistical Learning Theory. John Wiley & Sons.,

volumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 TheTradeOffBetweenGenAndDiscrClassifiersThe Trade-off Between Generative and Discriminative ClassifiersProceedings of the Symposium on Computational Statisticshttp://lear.inrialpes.fr/pubs/2004/BT04/Bouchard-compstat04.pdf2004