# Pointwise Mutual Information (PMI) Measure

A Pointwise Mutual Information (PMI) Measure is a binary random variable measure of association for [math]x,y[/math] based the ratio between the co-occurrence probability [math]P(x,y)[/math] and the independent probability of observing [math]x,y[/math] by chance, [math]p(x)p(y)[/math].

**Context:**- inputs ([math]x, y, \mathbf{C}[/math]):
- two Multiset Patterns, [math]x, y[/math] (e.g. a vocabulary members).
- a Multiset Set, [math]\mathbf{C}[/math] (e.g. a corpus).

**range:**PMI Score.- definition:
- [math] \operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}.[/math]

- It can overestimate patterns with low observed frequency counts (and be updated to increase the influence of the observed frequency).
- It assumes Statistical Independence.
- It can range from being a Shifted PMI Measure??

- inputs ([math]x, y, \mathbf{C}[/math]):
**Example(s):**- Given 2x2 contingency table [math] \begin{array}{c|cc|c} & pattern_2 & \neg pattern_2 & \\ \hline pattern_1 & n_{11} & n_{12} & n_{1p} \\ \neg pattern_1 & n_{21} & n_{22} & n_{2p} \\ \hline & n_{p1} & n_{p2} &n_{pp} \end{array} [/math]; [math] PMI = \log \bigl( \frac{n_{11}}{m_{11}} \bigr)[/math], where [math]m_{11} = \frac {n_{p1} n_{1p}}{n_{pp}}[/math].
- a Positive PMI.
- ...

**Counter-Example(s):****See:**Mutual Information, Specific Mutual Information, Second-Order PMI, Statistical Independence, PMI Matrix.

## References

### 2016

- (Wikipedia, 2016) ⇒ http://en.wikipedia.org/wiki/Pointwise_mutual_information#Definition Retrieved:2016-2-10.
- The PMI of a pair of outcomes
*x*and*y*belonging to discrete random variables*X*and Y*quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically: :[math] \operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}.[/math] The mutual information (MI) of the random variables*X and*Y*is the expected value of the PMI over all possible outcomes (with respect to the joint distribution [math]p(x,y)[/math]).The measure is symmetric ([math]\operatorname{pmi}(x;y)=\operatorname{pmi}(y;x)[/math]). It can take positive or negative values, but is zero if

*X*and Y*are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is positive. PMI maximizes when*X and*Y*are perfectly associated (i.e. [math]p(x|y)[/math] or [math]p(y|x)=1[/math]), yielding the following bounds: :[math] -\infty \leq \operatorname{pmi}(x;y) \leq \min\left[ -\log p(x), -\log p(y) \right] . [/math]Finally, [math]\operatorname{pmi}(x;y)[/math] will increase if [math]p(x|y)[/math] is fixed but [math]p(x)[/math]decreases.

Here is an example to illustrate:

Using this table we can marginalize to get the following additional table for the individual distributions:

With this example, we can compute four values for [math]pmi(x;y)[/math]. Using base-2 logarithms:

(For reference, the mutual information [math]\operatorname{I}(X;Y)[/math] would then be 0.214170945)

- The PMI of a pair of outcomes

### 2016

- (Wikipedia, 2016) ⇒ http://en.wikipedia.org/wiki/Semantic_similarity#Statistical_similarity Retrieved:2016-12-10.
- … PMI (Pointwise mutual information) (+) large vocab, because it uses any search engine (like Google); (−) cannot measure relatedness between whole sentences or documents

### 2016

- (Levy et al., 2015) ⇒ Omer Levy, Yoav Goldberg, and Ido Dagan. (2015). “Improving Distributional Similarity with Lessons Learned from Word Embeddings.” In: Transactions of the Association for Computational Linguistics, 3.
- QUOTE: … A popular measure of this association is pointwise mutual information (PMI) (Church and Hanks, 1990). PMI is defined as the log ratio between w and c’s joint probability and the product of their marginal probabilities, which can be estimated by: [math]PMI(w, c) = \log \frac{\hat{P}(w,c)}{\hat{P}(w) \hat{P}(c)} = \log \frac{\#(w,c)·|D|}{\#(w)·\#(c)}[/math] The rows of [math]M^{PMI}[/math] contain many entries of word-context pairs (w, c) that were never observed in the corpus, for which PMI(w, c) = log 0 = −1. A common approach is thus to replace the [math]M^{PMI}[/math] matrix with [math]M^{PMI}_0[/math], in which PMI(w, c) = 0 in cases where #(w, c) = 0. A more consistent approach is to use positive PMI (PPMI), in which all negative values are replaced by 0 ...

### 2011

- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/Pointwise_mutual_information
**Pointwise mutual information**(PMI), or**specific mutual information**, is a measure of association used in information theory and statistics.The PMI of a pair of outcomes [math]x[/math] and [math]y[/math] belonging to discrete random variables [math]X[/math] and [math]Y[/math] quantifies the discrepancy between the probability of their coincidence given their joint distribution and the probability of their coincidence given only their individual distributions, assuming independence. Mathematically:

[math]SI(x,y) = \log\frac{p(x,y)}{p(x)p(y)}.[/math]The mutual information (MI) of the random variables [math]X[/math] and [math]Y[/math] is the expected value of the PMI over all possible outcomes.

The measure is symmetric ([math]SI(x,y)=SI(y,x)[/math]). It can take on both negative and positive values but is zero if [math]X[/math] and [math]Y[/math] are independent, and equal to [math]-\log(p(x))[/math] if [math]X[/math] and [math]Y[/math] are perfectly associated. Finally, [math]SI(x,y)[/math] will increase if [math]p(x|y)[/math] is fixed but [math]p(x)[/math]decreases.

### 2009

- (Recchia & Jones, 2009) ⇒ Gabriel Recchia, and Michael N. Jones. (2009). “More Data Trumps Smarter Algorithms: Comparing Pointwise Mutual Information with Latent Semantic Analysis.” In: Behavior research methods, 41(3).

### 2006

- http://search.cpan.org/dist/Text-NSP/lib/Text/NSP/Measures/2D/MI/pmi.pm
- Assume that the frequency count data associated with a bigram <word1><word2> is stored in a 2x2 contingency table: [math] \begin{array}{c|cc|c} & \neg {word_2 } & ~word_2 & \\ \hline word_1 & n_{11} & n_{12} & n_{1p} \\ \neg word_1 & n_{21} & n_{22} & n_{2p} \\ \hline & n_{p1} & n_{p2} &n_{pp} \end{array} [/math] where [math]n_{11}[/math] is the number of times <word1><word2> occur together, and n12 is the number of times <word1> occurs with some word other than word2, and n1p is the number of times in total that word1 occurs as the first word in a bigram.
The expected values for the internal cells are calculated by taking the product of their associated marginals and dividing by the sample size, for example: [math]m_{11} = \frac {n_{p1} n_{1p}}{n_{pp}}[/math]

Pointwise Mutual Information (pmi) is defined as the log of the deviation between the observed frequency of a bigram (n11) and the probability of that bigram if it were independent (m11). :[math] PMI = \log \Bigl( \frac{n_{11}}{m_{11}} \Bigr)[/math] The Pointwise Mutual Information tends to overestimate bigrams with low observed frequency counts. To prevent this sometimes a variation of pmi is used which increases the influence of the observed frequency. :[math] PMI = \log \bigl(\frac{(n_{11})^{$exp}}{m_{11}})[/math] The $exp is 1 by default, so by default the measure will compute the Pointwise Mutual Information for the given bigram. To use a variation of the measure, users can pass the $exp parameter using the --pmi_exp command line option in statistic.pl or by passing the $exp to the initializeStatistic() method from their program.

- Assume that the frequency count data associated with a bigram <word1><word2> is stored in a 2x2 contingency table: [math] \begin{array}{c|cc|c} & \neg {word_2 } & ~word_2 & \\ \hline word_1 & n_{11} & n_{12} & n_{1p} \\ \neg word_1 & n_{21} & n_{22} & n_{2p} \\ \hline & n_{p1} & n_{p2} &n_{pp} \end{array} [/math] where [math]n_{11}[/math] is the number of times <word1><word2> occur together, and n12 is the number of times <word1> occurs with some word other than word2, and n1p is the number of times in total that word1 occurs as the first word in a bigram.

### 1994

- (Niwa & Nitta, 1994) ⇒ Yoshiki Niwa, and Yoshihiko Nitta. (1994). “Co-occurrence Vectors from Corpora Vs. Distance Vectors from Dictionaries.” In: Proceedings of the 15th conference on Computational linguistics - Volume 1. doi:10.3115/991886.991938
- QUOTE: We use ordinary co-occurrence statistics and measure the co-occurrence likelihood between two words, X and Y, by the mutual information estimate (Church and Hanks, 1989): :[math]I(\mathbf{X},\mathbf{Y}) = \log^+ \frac{P(\mathbf{X} \mid \mathbf{Y})}{P(\mathbf{X}})[/math], where P(X) is the occurrence density of word X in a whole corpus, and the conditional probability [math]P(\mathbf{X} \mid \mathbf{Y})[/math] is the density of word X in a neighborhood of word Y. Here the neighborhood is defined as 50 words before or after any appearance of word Y. (There is a variety of neighborhood definitions such as "100 surrounding words" (Yarowsky 1992) and "within a distance of no more thall 3 words ignoring function words" (Dagan et al, 1993).)

### 1991

- (Church et al., 1991) ⇒ Kenneth W. Church, William A. Gale, P. Hanks, and D. Hindle. (1991). “Using Statistics in Lexical Analysis.” In: Uri Zernik (ed.). (1991). “Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon.” Lawrence Erlbaum.

### 1989

- (Church et al., 1989) ⇒ Kenneth W. Church, and P. Hanks. (1989). “Word Association Norms, Mutual Information and Lexicography. In: Proceedings of the 27th Annual Conference of the Association of Computational Linguistics (ACL 1989).