Normalized Google Distance

From GM-RKB
Jump to navigation Jump to search

A normalized Google distance is a lexical semantic similarity measure NDG(x,y) for two search keywords (x,y) that is based on the count and intersection of the search results associated with each search keyword.



References

2009

  • http://en.wikipedia.org/wiki/Semantic_relatedness#Google_distance
    • Google distance is a measure of semantic interrelatedness derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Google distance, while words with dissimilar meanings tend to be farther apart.
      Specifically, the normalized Google distance between two search terms [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] is [math]\displaystyle{ \operatorname{NGD}(x,y) = \frac{\max\{\log f(x), \log f(y)\} - \log f(x,y)} {\log M - \min\{\log f(x), \log f(y)\}} }[/math] where [math]\displaystyle{ M }[/math] is the total number of web pages searched by Google; [math]\displaystyle{ f }[/math](x) and [math]\displaystyle{ f }[/math](y) are the number of hits for search terms [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math], respectively; and [math]\displaystyle{ f }[/math](x, y) is the number of web pages on which both [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] occur.

      If the two search terms [math]\displaystyle{ x }[/math] and [math]\displaystyle{ y }[/math] never occur together on the same web page, but do occur separately, the normalized Google distance between them is infinite. If both terms always occur together, their NGD is zero.

2007

  • (Gligorov et al., 2007) ⇒ Risto Gligorov, Warner ten Kate, Zharko Aleksovski, and Frank van Harmelen. (2007). “Using Google distance to weight approximate ontology matches.” In: Proceedings of the 16th International Conference on World Wide Web. doi:10.1145/1242572.1242676.
    • We utilise a dissimilarity measure, called Normalised Google Distance (NGD), introduced in [9]. NGD takes advantage of the number of hits returned by Google to compute the semantic distance between concepts. The concepts are represented with their labels which are fed to the Google search engine as search terms. Given two search terms x and y, the the normalised Google distance between x and y, NGD(x, y), can be obtained as follows NGD(x, y) = max{log f(x), log f(y)} − log f(x, y) / (logM − min{log f(x), log f(y)}) where
      • f(x) is the number of Google hits for the search term x,
      • f(y) is the number of Google hits for the search term y,
      • f(x, y) is the number of Google hits for the tuple of search terms x y and
      • M is the number of web pages indexed by Google5.
    • Intuitively, NGD(x, y) is a measure for the symmetric conditional probability of co-occurrence of the terms [math]\displaystyle{ x }[/math] and y: given a web-page containing one of the terms [math]\displaystyle{ x }[/math] or [math]\displaystyle{ y }[/math], NGD(x, y) measures the probability of that web-page also containing the other term. (The NGD measure assumes monotonicity of Google. In reality Google is known to show non-monotonic behaviour, i.e. adding more words in the search query may increase the number of hits instead of decrease it. Yet, such cases are exceptions and did not affect the results of our experiments.)


  • (Cilibrasi & Vitanyi, 2007) ⇒ R. L. Cilibrasi and P. M. B. Vitanyi. (2007). “The Google Similarity Distance.” In: IEEE Transactions on Knowledge and Data Engineering 19(3). doi:10.1109/TKDE.2007.48
    • 1) The range of the NGD is in between 0 and ∞ (sometimes slightly negative if the Google counts are untrustworthy and state f(x, y) > max{f(x), f(y)}, See Section I-D);
    • 2) The NGD is always nonnegative and NGD(x, x) = 0 for every x.
    • 3) The NGD is scale-invariant in the following sense: Assume that when the number N of pages indexed by Google (accounting for the multiplicity of different search terms per page) grows, the number of pages containing a given search term goes to a fixed fraction of N, and so does the number of pages containing a given conjunction of search terms.