Pointwise Mutual Information (PMI) Measure

A Pointwise Mutual Information (PMI) Measure is a binary random variable measure of association for [math]\displaystyle{ x,y }[/math] based the ratio between the co-occurrence probability [math]\displaystyle{ P(x,y) }[/math] and the independent probability of observing [math]\displaystyle{ x,y }[/math] by chance, [math]\displaystyle{ p(x)p(y) }[/math].

Context:
- inputs ([math]\displaystyle{ x, y, \mathbf{C} }[/math]):
  - two Multiset Patterns, [math]\displaystyle{ x, y }[/math] (e.g. a vocabulary members).
  - a Multiset Set, [math]\displaystyle{ \mathbf{C} }[/math] (e.g. a corpus).
- range: PMI Score.
- definition:
  - [math]\displaystyle{ \operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}. }[/math]
- It can overestimate patterns with low observed frequency counts (and be updated to increase the influence of the observed frequency).
- It assumes Statistical Independence.
- It can range from being a Shifted PMI Measure??
Example(s):
- Given 2x2 contingency table [math]\displaystyle{ \begin{array}{c|cc|c} & pattern_2 & \neg pattern_2 & \\ \hline pattern_1 & n_{11} & n_{12} & n_{1p} \\ \neg pattern_1 & n_{21} & n_{22} & n_{2p} \\ \hline & n_{p1} & n_{p2} &n_{pp} \end{array} }[/math]; [math]\displaystyle{ PMI = \log \bigl( \frac{n_{11}}{m_{11}} \bigr) }[/math], where [math]\displaystyle{ m_{11} = \frac {n_{p1} n_{1p}}{n_{pp}} }[/math].
- a Positive PMI.
- …
Counter-Example(s):
See: Mutual Information, Specific Mutual Information, Second-Order PMI, Statistical Independence, PMI Matrix.

References

2016

(Wikipedia, 2016) ⇒ http://en.wikipedia.org/wiki/Pointwise_mutual_information#Definition Retrieved:2016-2-10.
- The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically: :[math]\displaystyle{ \operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}. }[/math] The mutual information (MI) of the random variables X and Y is the expected value of the PMI over all possible outcomes (with respect to the joint distribution [math]\displaystyle{ p(x,y) }[/math]).
  The measure is symmetric ([math]\displaystyle{ \operatorname{pmi}(x;y)=\operatorname{pmi}(y;x) }[/math]). It can take positive or negative values, but is zero if X and Y are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is positive. PMI maximizes when X and Y are perfectly associated (i.e. [math]\displaystyle{ p(x|y) }[/math] or [math]\displaystyle{ p(y|x)=1 }[/math]), yielding the following bounds: :[math]\displaystyle{ -\infty \leq \operatorname{pmi}(x;y) \leq \min\left[ -\log p(x), -\log p(y) \right] . }[/math]
  Finally, [math]\displaystyle{ \operatorname{pmi}(x;y) }[/math] will increase if [math]\displaystyle{ p(x|y) }[/math] is fixed but [math]\displaystyle{ p(x) }[/math]decreases.
  Here is an example to illustrate:
  Using this table we can marginalize to get the following additional table for the individual distributions:
  With this example, we can compute four values for [math]\displaystyle{ pmi(x;y) }[/math]. Using base-2 logarithms:
  (For reference, the mutual information [math]\displaystyle{ \operatorname{I}(X;Y) }[/math] would then be 0.214170945)

2016

(Wikipedia, 2016) ⇒ http://en.wikipedia.org/wiki/Semantic_similarity#Statistical_similarity Retrieved:2016-12-10.
- … PMI (Pointwise mutual information) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents

2016

(Levy et al., 2015) ⇒ Omer Levy, Yoav Goldberg, and Ido Dagan. (2015). “Improving Distributional Similarity with Lessons Learned from Word Embeddings.” In: Transactions of the Association for Computational Linguistics, 3.
- QUOTE: … A popular measure of this association is pointwise mutual information (PMI) (Church and Hanks, 1990). PMI is defined as the log ratio between w and c’s joint probability and the product of their marginal probabilities, which can be estimated by: [math]\displaystyle{ PMI(w, c) = \log \frac{\hat{P}(w,c)}{\hat{P}(w) \hat{P}(c)} = \log \frac{\#(w,c)·|D|}{\#(w)·\#(c)} }[/math] The rows of [math]\displaystyle{ M^{PMI} }[/math] contain many entries of word-context pairs (w, c) that were never observed in the corpus, for which PMI(w, c) = log 0 = −1. A common approach is thus to replace the [math]\displaystyle{ M^{PMI} }[/math] matrix with [math]\displaystyle{ M^{PMI}_0 }[/math], in which PMI(w, c) = 0 in cases where #(w, c) = 0. A more consistent approach is to use positive PMI (PPMI), in which all negative values are replaced by 0 ...

2011

(Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Pointwise_mutual_information
- Pointwise mutual information (PMI), or specific mutual information, is a measure of association used in information theory and statistics.
  The PMI of a pair of outcomes [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] belonging to discrete random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] quantifies the discrepancy between the probability of their coincidence given their joint distribution and the probability of their coincidence given only their individual distributions, assuming independence. Mathematically:
  [math]\displaystyle{ SI(x,y) = \log\frac{p(x,y)}{p(x)p(y)}. }[/math]
  The mutual information (MI) of the random variables [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] is the expected value of the PMI over all possible outcomes.
  The measure is symmetric ([math]\displaystyle{ SI(x,y)=SI(y,x) }[/math]). It can take on both negative and positive values but is zero if [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] are independent, and equal to [math]\displaystyle{ -\log(p(x)) }[/math] if [math]\displaystyle{ X }[/math] and [math]\displaystyle{ Y }[/math] are perfectly associated. Finally, [math]\displaystyle{ SI(x,y) }[/math] will increase if [math]\displaystyle{ p(x|y) }[/math] is fixed but [math]\displaystyle{ p(x) }[/math]decreases.

2009

(Recchia & Jones, 2009) ⇒ Gabriel Recchia, and Michael N. Jones. (2009). “More Data Trumps Smarter Algorithms: Comparing Pointwise Mutual Information with Latent Semantic Analysis.” In: Behavior research methods, 41(3).

2006

http://search.cpan.org/dist/Text-NSP/lib/Text/NSP/Measures/2D/MI/pmi.pm
- Assume that the frequency count data associated with a bigram <word1><word2> is stored in a 2x2 contingency table: [math]\displaystyle{ \begin{array}{c|cc|c} & \neg {word_2 } & ~word_2 & \\ \hline word_1 & n_{11} & n_{12} & n_{1p} \\ \neg word_1 & n_{21} & n_{22} & n_{2p} \\ \hline & n_{p1} & n_{p2} &n_{pp} \end{array} }[/math] where [math]\displaystyle{ n_{11} }[/math] is the number of times <word1><word2> occur together, and n12 is the number of times <word1> occurs with some word other than word2, and n1p is the number of times in total that word1 occurs as the first word in a bigram.
  The expected values for the internal cells are calculated by taking the product of their associated marginals and dividing by the sample size, for example: [math]\displaystyle{ m_{11} = \frac {n_{p1} n_{1p}}{n_{pp}} }[/math]
  Pointwise Mutual Information (pmi) is defined as the log of the deviation between the observed frequency of a bigram (n11) and the probability of that bigram if it were independent (m11). :[math]\displaystyle{ PMI = \log \Bigl( \frac{n_{11}}{m_{11}} \Bigr) }[/math] The Pointwise Mutual Information tends to overestimate bigrams with low observed frequency counts. To prevent this sometimes a variation of pmi is used which increases the influence of the observed frequency. :[math]\displaystyle{ PMI = \log \bigl(\frac{(n_{11})^{$exp}}{m_{11}}) }[/math] The $exp is 1 by default, so by default the measure will compute the Pointwise Mutual Information for the given bigram. To use a variation of the measure, users can pass the $exp parameter using the --pmi_exp command line option in statistic.pl or by passing the $exp to the initializeStatistic() method from their program.

1994

(Niwa & Nitta, 1994) ⇒ Yoshiki Niwa, and Yoshihiko Nitta. (1994). “Co-occurrence Vectors from Corpora Vs. Distance Vectors from Dictionaries.” In: Proceedings of the 15th conference on Computational linguistics - Volume 1. doi:10.3115/991886.991938
- QUOTE: We use ordinary co-occurrence statistics and measure the co-occurrence likelihood between two words, X and Y, by the mutual information estimate (Church and Hanks, 1989): :[math]\displaystyle{ I(\mathbf{X},\mathbf{Y}) = \log^+ \frac{P(\mathbf{X} \mid \mathbf{Y})}{P(\mathbf{X}}) }[/math], where P(X) is the occurrence density of word X in a whole corpus, and the conditional probability [math]\displaystyle{ P(\mathbf{X} \mid \mathbf{Y}) }[/math] is the density of word X in a neighborhood of word Y. Here the neighborhood is defined as 50 words before or after any appearance of word Y. (There is a variety of neighborhood definitions such as "100 surrounding words" (Yarowsky 1992) and "within a distance of no more thall 3 words ignoring function words" (Dagan et al, 1993).)

1991

(Church et al., 1991) ⇒ Kenneth W. Church, William A. Gale, P. Hanks, and D. Hindle. (1991). “Using Statistics in Lexical Analysis.” In: Uri Zernik (ed.). (1991). “Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.” Lawrence Erlbaum.

1989

(Church et al., 1989) ⇒ Kenneth W. Church, and P. Hanks. (1989). “Word Association Norms, Mutual Information and Lexicography. In: Proceedings of the 27th Annual Conference of the Association of Computational Linguistics (ACL 1989).

Pointwise Mutual Information (PMI) Measure

References

2016

2016

2016

2011

2009

2006

1994

1991

1989

Navigation menu

Search