# Joint Probability Function

A joint probability Function is a multivariate probability function for two or more random experiment events occurring together.

**AKA:**[math]\displaystyle{ \mathrm{P}(X_1,\ldots,X_n) }[/math]; Joint PDF, Co-Occurrence Probability.**Context:**- It can range from being a Joint Probability Function Structure/Joint Probability Function Instance (e.g. an estimated conditional probability via joint probability estimation) to being a Abstract Joint Probability Distribution.
- It can range from being a Joint Probability Density Function to being a Joint Probability Mass Function.
- It can define a Joint Cumulative Distribution Function, if the random variables are Continuous Random Variables.

**Example(s):**- for a Two Dice Experiment.
*P*(dice_{a}=1, dice_{b}=1).*P*(dice_{a}>3, dice_{b}<3).

- …

- for a Two Dice Experiment.
**Counter-Example(s):**- a Conditional Probability Function, [math]\displaystyle{ P(A,B) }[/math].
- a Marginal Probability Function.

**See:**Prior Probability, Bivariate Distribution, Multivariate Distribution.

## References

### 2015

- (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/joint_probability_distribution Retrieved:2015-2-10.
- In the study of probability, given at least two random variables
*X*,*Y*, ..., that are defined on a probability space, the 'joint probability distribution for*X*,*Y*, … is a probability distribution that gives the probability that each of X*,*Y, … falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a**bivariate distribution**, but the concept generalizes to any number of random variables, giving a multivariate distribution.The joint probability distribution can be expressed either in terms of a joint cumulative distribution function or in terms of a joint probability density function (in the case of continuous variables) or joint probability mass function (in the case of discrete variables). These in turn can be used to find two other types of distributions: the marginal distribution giving the probabilities for any one of the variables with no reference to any specific ranges of values for the other variables, and the conditional probability distribution giving the probabilities for any subset of the variables conditional on particular values of the remaining variables.

- In the study of probability, given at least two random variables

### 2009

- (Wordnet, 2009) ⇒ http://wordnet.princeton.edu/perl/webwn?s=joint%20probability
- the probability of two events occurring together

### 2004

- (Bouchard & Triggs, 2004) ⇒ Guillaume Bouchard, and Bill Triggs. (2004). “The Trade-off Between Generative and Discriminative Classifiers.” In: Proceedings of COMPSTAT 2004.
- QUOTE: In supervised classification, inputs [math]\displaystyle{ x }[/math] and their labels [math]\displaystyle{ y }[/math] arise from an unknown joint probability
*p*(*x*,*y*). If we can approximate*p*(*x*,*y*) using a parametric family of models [math]\displaystyle{ G }[/math] = {*p*_{θ}(x*,*y*),*θ*∈ Θ}, then a natural classifier is obtained by first estimating the class-conditional densities, then classifying each new data point to the class with highest posterior probability. This approach is called*generative*classification.*However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density [math]\displaystyle{ p(y \vert x) }[/math]. Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than.

*Conversely, the generative approach converges to the best model for the joint distribution*p*(*x*,*y*) but the resulting conditional density is usually a biased classifier unless its*p_{θ}(*x*) part is an accurate model for*p*(*x*). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that minimizes the negative log-likelihood classification loss against the true density p(x, y) [2]. For finite sample sizes, there is a bias-variance tradeoff and it is less obvious how to choose between generative and discriminative classifiers.

- QUOTE: In supervised classification, inputs [math]\displaystyle{ x }[/math] and their labels [math]\displaystyle{ y }[/math] arise from an unknown joint probability

### 1998

- (Murphy, 1998) ⇒ Kevin P. Murphy. (1998). “A Brief Introduction to Graphical Models and Bayesian Networks." Web tutorial.
- QUOTE: A graphical model specifies a complete joint probability distribution (JPD) over all the variables. Given the JPD, we can answer all possible inference queries by marginalization (summing out over irrelevant variables), as illustrated in the introduction. However, the JPD has size [math]\displaystyle{ O(2^n) }[/math], where n is the number of nodes, and we have assumed each node can have 2 states. Hence summing over the JPD takes exponential time. We now discuss more efficient methods.

### 1987

- (Hogg & Ledolter, 1987) ⇒ Robert V. Hogg, and Johannes Ledolter. (1987). “Engineering Statistics." Macmillan Publishing. ISBN:0023557907
- QUOTE:Multivariate Distributions: … We start our discussion by considering the probabilities that are associated with two random variables, [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math]. We call the probability function [math]\displaystyle{ f(x,y) = P(X=x, Y=y), (x,y) \in \Re }[/math], where [math]\displaystyle{ \Re }[/math] is the space of ([math]\displaystyle{ X,Y }[/math]), the [[Joint Probability Density Function|joint probability density function of [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] or simply the joint density of [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math].

### 1986

- (Larsen & Marx, 1986) ⇒ Richard J. Larsen, and Morris L. Marx. (1986). “An Introduction to Mathematical Statistics and Its Applications, 2nd edition." Prentice Hall
**Definition 3.3.1.**.- (a) Suppose that [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] are two discrete random variables defined ont he same sample space
*S*. The(or**joint probability density functionof X and Y***joint pdf*) is defined*f*(_{X,Y}*x,y*), where.*f*(_{X,Y}*x,y*) =*P*({*s*∈*S*|*X*(*s*) = [math]\displaystyle{ x }[/math],*Y*(*s*) =*y*}})*f*(_{X,Y}*x,y*) =*P*(*X*=*x*,*Y*=*y*)

- (b) Suppose that [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] are two continuous random variables defined over the sample sample space
*S*. The joint pdf of [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math],*f*(_{X,Y}*x,y*), is the surface having the property that for any region [math]\displaystyle{ R }[/math] in the*xy*-plane,

- (a) Suppose that [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] are two discrete random variables defined ont he same sample space
- '
*Definition 3.3.2. Let [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] be two random variables defined on the same sample space*S*. The*joint cumulative distribution function (or*joint cdf*)*of X and Y*is defined*F*(_{X,Y}*x,y*), where*F*(_{X,Y}*x,y*) =*P*({*s*∈*S*}*X*(*s*) <= [math]\displaystyle{ x }[/math] and*Y*(*s*) <=*y*})*F*(_{X,Y}*x,y*) =*P*(*X*≤*x*,*Y*≤*y*).