A conditional probability function is a multivariate probability function that returns the conditional probability value for event A∈S1 (given aposteriori knowledge of some event, B∈S2).
References
- http://wordnet.princeton.edu/perl/webwn?s=conditional%20probability
- the probability that an event will occur given that one or more other events have occurred
- (Wikipedia, 2009) http://en.wikipedia.org/wiki/Conditional_probability_distribution
- Given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X (written "Y | X") is the probability distribution of Y when X is known to be a particular value.
- For discrete random variables, the conditional probability mass function It can be written as P(Y = y | X = x). From the definition of conditional probability, this is
- P(Y = y \mid X = x) = \frac{P(X=x\ \cap Y=y)}{P(X=x)}= \frac{P(X = x \mid Y = y) P(Y = y)}{P(X = x)}.
- Similarly for continuous random variables, the conditional probability density function It can be written as pY|X(y | x) and this is
- p_{Y \mid X}(y \mid x) = \frac{p_{X, Y}(x, y)}{p_X(x)}= \frac{p_{X \mid Y}(x \mid y)p_Y(y)}{p_X(x)},
- where pX,Y(x, y) gives the joint distribution of X and Y, while pX(x) gives the marginal distribution for X.
- (Wikipedia, 2009) http://en.wikipedia.org/wiki/Conditional_probability
- Conditional probability is the probability of some event A, given the occurrence of some other event B. Conditional probability is written P(A|B), and is read "the probability of A, given B".
- Joint probability is the probability of two events in conjunction. That is, it is the probability of both events together. The joint probability of A and B is written \scriptstyle P(A \cap B) or \scriptstyle P(A, B).
- Marginal probability is then the unconditional probability P(A) of the event A; that is, the probability of A, regardless of whether event B did or did not occur. If B can be thought of as the event of a random variable X having a given outcome, the marginal probability of A can be obtained by summing (or integrating, more generally) the joint probabilities over all outcomes for X. For example, if there are two possible outcomes for X with corresponding events B and B', this means that \scriptstyle P(A) = P(A \cap B) + P(A \cap B^'). This is called marginalization.
- http://www.sci.tamucc.edu/~eyoung/1351/prob_vocabulary.html
- the probability of event A happening after event B has already happened and changed the sample space
2004
- (Bouchard & Triggs, 2004) => Guillaume Bouchard, and Bill Triggs. (2004). "The Trade-off Between Generative and Discriminative Classifiers." In: Proceedings of COMPSTAT 2004.
- Quote:
- In supervised classification, inputs x and their labels y arise from an unknown joint probability p(x,y). If we can approximate p(x,y) using a parametric family of models G = {pθ(x,y),θ ∈ Θ}, then a natural classifier is obtained by first estimating the class-conditional densities, then classifying each new data point to the class with highest posterior probability. This approach is called generative classification.
- However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density p(y|x). "Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than.
- Conversely, the generative approach converges to the best model for the joint distribution p(x,y) but the resulting conditional density is usually a biased classifier unless its pθ(x) part is an accurate model for p(x). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that minimizes the negative log-likelihood classification loss against the true density p(x, y) [2]. For finite sample sizes, there is a bias-variance tradeoff and it is less obvious how to choose between generative and discriminative classifiers.