# Discriminative Learning Algorithm

A Discriminative Learning Algorithm is a probabilistic learning algorithm that produces a predictive model (a discriminatively trained model) by directly estimating the conditional probability and the aposteriori probability of the target attribute with respect to the predictor variables.

**AKA:**Distribution-free ML Algorithm, Discriminative Training.**Context:**- It can range from (typically) being a Discriminative Classification Algorithm to being a Discriminative Estimation Algorithm.
- It can (typically) can involve a simpler Parameter Estimation and makes fewer assumptions than a Generative Algorithm.
- It does not result in Probability Functions (but Weights).

**Example(s):****Counter-Example(s):****See:**Discriminative Model Inferencing Algorithm; Generative References.

## References

### 2011

- (Sammut & Webb, 2011) ⇒ Claude Sammut (editor), and Geoffrey I. Webb (editor). (2011). “Discriminative Learning.” In: (Sammut & Webb, 2011).
**Discriminative Learning - Definition:**Discriminative learning refers to any classification learning process that classifies by using a model or estimate of the probability [math]P(x|y)[/math] without reference to an explicit estimate of any of [math]P(x)[/math], [math]P(y, x)[/math], or [math]P(x|y)[/math], where [math]y[/math] is a class and [math]x[/math] is a description of an object to be classified. Discriminative learning contrasts to**generative learning**which classifies by using an estimate of the joint probability [math]P(y, x)[/math] or of the prior probability [math]P(y)[/math] and the conditional probability [math]P(x|y)[/math]. It is also common to categorize as discriminative any approaches that are directly based on a decision risk function (such as Support Vector Machines, Artificial Neural Networks, and Decision Trees), where the decision risk is minimized without estimation of [math]P(x)[/math], [math]P(y, x)[/math], or [math]P(x|y)[/math].

### 2009

- (Wick et al., 2009) ⇒ Michael Wick, Aron Culotta, Khashayar Rohanimanesh, and Andrew McCallum. (2009). “An Entity Based Model for Coreference Resolution.” In: Proceedings of the SIAM International Conference on Data Mining (SDM 2009).
- Statistical approaches to coreference resolution can be broadly placed into two categories: generative models, which model the joint probability, and
**discriminative models**that model that conditional probability. These models can be either supervised (uses labeled coreference data for learning) or unsupervised (no labeled data is used). Our model falls into the category of discriminative and supervised.

- Statistical approaches to coreference resolution can be broadly placed into two categories: generative models, which model the joint probability, and

### 2005

- (Minka, 2005) ⇒ Thomas P. Minka. (2005). “Discriminative Models, not Discriminative Training" Technical Report MSR-TR-2005-144, Microsoft Research.
- QUOTE: By taking this view, you have a consistent approach to statistical inference: you always model all variables, and you always use joint likelihood. The only thing that changes is the model. You can also see clearly why discriminative training might work better than generative training. It must be because a model of the form (5) fits the data better than (1). In particular, (5) is necessarily more flexible than (1), because it removes the implicit constraint that [math]\theta=\theta'[/math]. Removing constraints reduces the statistical bias, at the cost of greater parameter uncertainty.

### 2002

- (Collins, 2002b) ⇒ Michael Collins. (2002). “Discriminative Training Methods for Hidden Markov Models: Theory and experiments with the perceptron algorithm.” In: Proceedings of the ACL Conference on Empirical Methods in Natural Language Processing, (EMNLP 2002). doi:10.3115/1118693.1118694
**Abstract:**We describe new algorithms for training tagging Model|models, as an alternative to maximum-entropy models or conditional random fields (CRFs). The algorithms rely on Viterbi decoding of training examples, combined with simple additive updates. We describe theory justifying the algorithms through a modification of the proof of convergence of the perceptron algorithm for classification problems. We give experimental results on part-of-speech tagging and base noun phrase chunking, in both cases showing improvements over results for a maximum-entropy tagger.