Conditional Probability Function

A conditional probability function is a multivariate probability function, [math]\displaystyle{ P(X|Y_1,...,Y_n) }[/math], that returns the probability for event [math]\displaystyle{ X }[/math] given aposteriori knowledge of some events, [math]\displaystyle{ Y_1,...,Y_n }[/math].

AKA: Conditional Distribution, Conditional Density, Conditional Likelihood.
Context:
- It must have input of : one or more random variables.
- It must have input of : one or more random variables and corresponding guaranteed events.
- Function Output: a conditional probability value.
- It can be denoted as P(A|B).
- It can (typically) assume that events [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y_1,...,Y_n }[/math] are from the same non-empty Sample Space [math]\displaystyle{ \mathcal{S} }[/math].
- It assumes that Event [math]\displaystyle{ Y_1,...,Y_n }[/math] is not an Impossible Event [math]\displaystyle{ P(Y_1,...,Y_n)\gt 0 }[/math].
- It can be calculated as P(A|B) = P(A∩B)/P(B), I.e. the joint probability that [math]\displaystyle{ Y_1,...,Y_n }[/math] also occurs is the probability that they both occur divided by the probability that [math]\displaystyle{ X }[/math] occurs.
- It can be estimated by a Conditional Probability Function Estimation Algorithm.
- It can range from being a Conditional Probability Mass Function to being a Conditional Probability Density Function.
- It can be associated to a Conditional Probability Distribution.
Example(s):
- a Probability Function that tells you the probability of a Two-Dice Experiment given knowledge of the outcome of one of the dice rolls.
- a Trained CRF Model.
Counter-Example(s):
- a Marginal Probability Function.
- a Class Conditional Probability Function.
- a Joint Probability Function, [math]\displaystyle{ \mathrm{P}(X_1,\ldots,X_n) }[/math].
See: Bayes Rule, Statistical Independence, Independent Outcome Relation.

References

2013

http://en.wikipedia.org/wiki/Conditional_probability
- In probability theory, a conditional probability is the probability that an event will occur, when another event is known to occur or to have occurred. If the events are A and B respectively, this is said to be "the probability of A given B". It is commonly denoted by P(A|B), or sometimes PTemplate:Sub(A). P(A|B) may or may not be equal to P(A), the probability of A. If they are equal, A and B are said to be independent. For example, if a coin is flipped twice, "the outcome of the second flip" is independent of "the outcome of the first flip".
  In the Bayesian interpretation of probability, the conditioning event is interpreted as evidence for the conditioned event. That is, P(A) is the probability of A before accounting for evidence E, and P(A|E) is the probability of A having accounted for evidence E.

2011

(Wikipedia, 2011) http://en.wikipedia.org/wiki/Conditional_probability
- Given two jointly distributed random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math], the conditional probability distribution of [math]\displaystyle{ Y }[/math] given [math]\displaystyle{ X }[/math] is the probability distribution of [math]\displaystyle{ Y }[/math] when [math]\displaystyle{ X }[/math] is known to be a particular value. For discrete random variables, the conditional probability mass function of [math]\displaystyle{ Y }[/math] given (the occurrence of) the value [math]\displaystyle{ x }[/math] of [math]\displaystyle{ X }[/math], can be written, using the definition of conditional probability, as: [math]\displaystyle{ p_Y(y\mid X = x)=P(Y = y \mid X = x) = \frac{P(X=x\ \cap Y=y)}{P(X=x)}. }[/math]
  As seen from the definition, and due to its occurrence, it is necessary that [math]\displaystyle{ P(X=x) \gt 0. }[/math]
  The relation with the probability distribution of [math]\displaystyle{ X }[/math] given [math]\displaystyle{ Y }[/math] is: [math]\displaystyle{ P(Y=y \mid X=x) P(X=x) = P(X=x\ \cap Y=y) = P(X=x \mid Y=y)P(Y=y). }[/math]
  Similarly for continuous random variables, the conditional probability density function of [math]\displaystyle{ Y }[/math] given (the occurrence of) the value [math]\displaystyle{ x }[/math] of [math]\displaystyle{ X }[/math], can be written as [math]\displaystyle{ f_Y(y \mid X=x) = \frac{f_{X, Y}(x, y)}{f_X(x)}, }[/math] where f_X,Y(x, y) gives the joint density of [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math], while f_X(x) gives the marginal density for X. Also in this case it is necessary that [math]\displaystyle{ f_X(x)\gt 0 }[/math].
  The relation with the probability distribution of [math]\displaystyle{ X }[/math] given [math]\displaystyle{ Y }[/math] is given by: [math]\displaystyle{ f_Y(y \mid X=x)f_X(x) = f_{X,Y}(x, y) = f_X(x \mid Y=y)f_Y(y). }[/math]
  The concept of the conditional distribution of a continuous random variable is not as intuitive as it might seem: Borel's paradox shows that conditional probability density functions need not be invariant under coordinate transformations.
  If for discrete random variables P(Y = [math]\displaystyle{ y }[/math] | [math]\displaystyle{ X }[/math] = x) = P(Y = y) for all [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math], or for continuous random variables [math]\displaystyle{ f }[/math]_Y(y | X=x) = [math]\displaystyle{ f }[/math]_Y(y) for all x and y, then [math]\displaystyle{ Y }[/math] is said to be independent of [math]\displaystyle{ X }[/math] (and this implies that [math]\displaystyle{ X }[/math] is also independent of Y).
  Seen as a function of [math]\displaystyle{ y }[/math] for given [math]\displaystyle{ x }[/math], P(Y = [math]\displaystyle{ y }[/math] | [math]\displaystyle{ X }[/math] = x) is a probability and so the sum over all [math]\displaystyle{ y }[/math] (or integral if it is a conditional probability density) is 1. Seen as a function of [math]\displaystyle{ x }[/math] for given [math]\displaystyle{ y }[/math], it is a likelihood function, so that the sum over all [math]\displaystyle{ x }[/math] need not be 1.

2007

(WordNet, 2007) ⇒ http://wordnet.princeton.edu/perl/webwn?s=conditional%20probability
- the probability that an event will occur given that one or more other events have occurred

2004

(Bouchard & Triggs, 2004) ⇒ Guillaume Bouchard, and Bill Triggs. (2004). "The Trade-off Between Generative and Discriminative Classifiers." In: Proceedings of COMPSTAT 2004.
- QUOTE: In supervised classification, inputs [math]\displaystyle{ x }[/math] and their labels [math]\displaystyle{ y }[/math] arise from an unknown joint probability p(x,y). If we can approximate p(x,y) using a parametric family of models [math]\displaystyle{ G }[/math] = {p_θ(x,y),θ ∈ Θ}, then a natural classifier is obtained by first estimating the class-conditional densities, then classifying each new data point to the class with highest posterior probability. This approach is called generative classification.
  However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density [math]\displaystyle{ p(y \vert x) }[/math].
  Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than.
  Conversely, the generative approach converges to the best model for the joint distribution p(x,y) but the resulting conditional density is usually a biased classifier unless its p_θ(x) part is an accurate model for p(x). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that minimizes the negative log-likelihood classification loss against the true density p(x, y) [2]. For finite sample sizes, there is a bias-variance tradeoff and it is less obvious how to choose between generative and discriminative classifiers.