2007 UsingGoogleDistToWeightApproxOntMatches

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Normalized Google Distance.

Notes

Cited By

Quotes

Abstract

Discovering mappings between concept hierarchies is widely regarded as one of the hardest and most urgent problems facing the Semantic Web. The problem is even harder in domains where concepts are inherently vague and ill-defined, and cannot be given a crisp definition. A notion of approximate concept mapping is required in such domains, but until now, no such notion is vailable.

The first contribution of this paper is a definition for approximate mappings between concepts. Roughly, a mapping between two concepts is decomposed into a number of submappings, and a sloppiness value determines the fraction of these submappings that can be ignored when establishing the mapping.

A potential problem of such a definition is that with an increasing sloppiness value, it will gradually allow mappings between any two arbitrary concepts. To improve on this trivial behaviour, we need to design a heuristic weighting which minimises the sloppiness required to conclude desirable matches, but at the same time maximises the sloppiness required to conclude undesirable matches. The second contribution of this paper is to show that a Google based similarity measure has exactly these desirable properties.

We establish these results by experimental validation in the domain of musical genres. We show that this domain does suffer from ill-defined concepts. We take two real-life genre hierarchies from the Web, we compute approximate mappings between them at varying levels of sloppiness, and we validate our results against a handcrafted Gold Standard.

Our method makes use of the huge amount of knowledge that is implicit in the current Web, and exploits this knowledge as a heuristic for establishing approximate mappings between ill-defined concepts.

1 Introduction

3.4 Google-based weighting

We utilise a dissimilarity measure, called Normalised Google Distance (NGD), introduced in [9]. NGD takes advantage of the number of hits returned by Google to compute the semantic distance between concepts. The concepts are represented with their labels which are fed to the Google search engine as search terms. Given two search terms x and y, the the normalised Google distance between x and y, NGD(x, y), can be obtained as follows

[math]\displaystyle{ NGD(x, y) = max{log f(x), log f(y)} − log f(x, y) / (logM − min{log f(x), log f(y)}) }[/math] where
  • f(x) is the number of Google hits for the search term x,
  • f(y) is the number of Google hits for the search term y,
  • f(x, y) is the number of Google hits for the tuple of search terms x y and
  • M is the number of web pages indexed by Google (Currently, the Google search engine indexes approximately ten billion pages ....)

Intuitively, NGD(x, y) is a measure for the symmetric conditional probability of co-occurrence of the terms [math]\displaystyle{ x }[/math] and y: given a web-page containing one of the terms [math]\displaystyle{ x }[/math] or [math]\displaystyle{ y }[/math], NGD(x, y) measures the probability of that web-page also containing the other term. (The NGD measure assumes monotonicity of Google. In reality Google is known to show non-monotonic behaviour, i.e. adding more words in the search query may increase the number of hits instead of decrease it. Yet, such cases are exceptions and did not affect the results of our experiments.)

References


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 UsingGoogleDistToWeightApproxOntMatchesFrank van Harmelen
Risto Gligorov
Warner ten Kate
Zharko Aleksovski
Using Google distance to weight approximate ontology matchesProceedings of the 16th International Conference on World Wide Webhttp://www.few.vu.nl/~frankh/postscript/WWW07.pdf10.1145/1242572.12426762007