Joint Probability Function

Jump to navigation Jump to search

A joint probability Function is a multivariate probability function for two or more random experiment events occurring together.





  • (Bouchard & Triggs, 2004) ⇒ Guillaume Bouchard, and Bill Triggs. (2004). “The Trade-off Between Generative and Discriminative Classifiers.” In: Proceedings of COMPSTAT 2004.
    • QUOTE: In supervised classification, inputs [math]\displaystyle{ x }[/math] and their labels [math]\displaystyle{ y }[/math] arise from an unknown joint probability p(x,y). If we can approximate p(x,y) using a parametric family of models [math]\displaystyle{ G }[/math] = {pθ(x,y),θ ∈ Θ}, then a natural classifier is obtained by first estimating the class-conditional densities, then classifying each new data point to the class with highest posterior probability. This approach is called generative classification.

      However, if the overall goal is to find the classification rule with the smallest error rate, this depends only on the conditional density [math]\displaystyle{ p(y \vert x) }[/math]. Discriminative methods directly model the conditional distribution, without assuming anything about the input distribution p(x). Well known generative-discriminative pairs include Linear Discriminant Analysis (LDA) vs. Linear logistic regression and naive Bayes vs. Generalized Additive Models (GAM). Many authors have already studied these models e.g. [5,6]. Under the assumption that the underlying distributions are Gaussian with equal covariances, it is known that LDA requires less data than its discriminative counterpart, linear logistic regression [3]. More generally, it is known that generative classifiers have a smaller variance than.

      Conversely, the generative approach converges to the best model for the joint distribution p(x,y) but the resulting conditional density is usually a biased classifier unless its pθ(x) part is an accurate model for p(x). In real world problems the assumed generative model is rarely exact, and asymptotically, a discriminative classifier should typically be preferred [9, 5]. The key argument is that the discriminative estimator converges to the conditional density that minimizes the negative log-likelihood classification loss against the true density p(x, y) [2]. For finite sample sizes, there is a bias-variance tradeoff and it is less obvious how to choose between generative and discriminative classifiers.




  • (Larsen & Marx, 1986) ⇒ Richard J. Larsen, and Morris L. Marx. (1986). “An Introduction to Mathematical Statistics and Its Applications, 2nd edition." Prentice Hall
    • Definition 3.3.1..
      • (a) Suppose that [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] are two discrete random variables defined ont he same sample space S. The joint probability density functionof X and Y (or joint pdf) is defined fX,Y(x,y), where.
        • fX,Y(x,y) = P({sS |X(s) = [math]\displaystyle{ x }[/math], Y(s) = y}})
        • fX,Y(x,y) = P(X=x, Y=y)
      • (b) Suppose that [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] are two continuous random variables defined over the sample sample space S. The joint pdf of [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math], fX,Y(x,y), is the surface having the property that for any region [math]\displaystyle{ R }[/math] in the xy-plane,
        • P((X,Y)∈R) = P({sS | (X(s). “Y(s))∈R})
        • P((X,Y)∈R) = IntegralR Integral [math]\displaystyle{ f }[/math]X,Y(x,y) dx dy.
    • 'Definition 3.3.2. Let [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] be two random variables defined on the same sample space S. The joint cumulative distribution function (or joint cdf) of X and Y is defined FX,Y(x,y), where
      • FX,Y(x,y) = P({sS } X(s) <= [math]\displaystyle{ x }[/math] and Y(s) <= y})
      • FX,Y(x,y) = P(Xx, Yy).