1995 UsingInformationContentToEvalSemSim

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Resnik Lexical Semantic Similarity Measure

Notes

Cited By

Quotes

Abstract

This paper presents a new measure of semantic similarity in an IS-A taxonomy, based on the notion of information content. Experimental evaluation suggests that the measure performs encouragingly well (a correlation of r = 0.79 with a benchmark set of human similarity judgments, with an upper bound of r = 0.90 for human subjects performing the same task), and significantly better than the traditional edge counting approach (r = 0.66).

1 Introduction

Evaluating semantic relatedness using network representations is a problem with a long history in artificial intelligence and psychology, dating back to the spreading activation approach of Quillian (1968) and Collins and Loftus (1975). Semantic similarity represents a special case of semantic relatedness: for example, cars and gasoline would seem to be more closely related than, say, cars and bicycles, but the latter pair are certainly more similar. Rada et al. (1989) suggest that the assessment of similarity in semantic networks can in fact be thought of as involving just taxonomic (is-a) links, to the exclusion of other link types; that view will also be taken here, although admittedly it excludes some potentially useful information.

A natural way to evaluate semantic similarity in a taxonomy is to evaluate the distance between the nodes corresponding to the items being compared — the shorter the path from one node to another, the more similar they are. Given multiple paths, one takes the length of the shortest one (Lee et al., 1993; Rada and Bicknell, 1989; Rada et al., 1989).

A widely acknowledged problem with this approach, however, is that it relies on the notion that links in the taxonomy represent uniform distances. Unfortunately, this is difficult to define, much less to control. In real taxonomies, there is wide variability in the “distance” covered by a single taxonomic link, particularly when certain sub-taxonomies (e.g. biological categories) are much denser than others. For example, in Word-Net [Miller, 1990], a broad-coverage semantic network for English constructed by George Miller and colleagues at Princeton, it is not at all difficult to find links that cover an intuitively narrow distance (rabbit ears is-a television antenna) or an intuitively wide one (phytoplankton is-a living thing). The same kinds of examples can be found in the Collins COBUILD Dictionary [Sinclair (ed.), 1987], which identifies superordinate terms for many words (e.g. safety valve is-a valve seems a lot narrower than knitting machine is-a machine).

In this paper, I describe an alternative way to evaluate semantic similarity in a taxonomy, based on the notion of information content. Like the edge counting method, it is conceptually quite simple. However, it is not sensitive to the problem of varying link distances. In addition, by combining a taxonomic structure with empirical probability estimates, it provides a way of adapting a static knowledge structure to multiple contexts. …

2 Similarity and Information Content

Let C be the set of concepts in an is-a taxonomy, permitting multiple inheritance. Intuitively, one key to the similarity of two concepts is the extent to which they share information in common, indicated in an is-a taxonomy by a highly specific concept that subsumes them both. The edge counting method captures this indirectly, since if the minimal path of is-a links between two nodes is long, that means it is necessary to go high in the taxonomy, to more abstract concepts, in order to find a least upper bound. For example, in WordNet, nickel and dime are both subsumed by coin, whereas the most specific superclass that nickel and credit card share is medium of exchange. (See Figure 1)

Following the standard argumentation of information theory (Ross, 1976), the information content of a concept [math]\displaystyle{ c }[/math] can be quantified as negative the log likelihood, [math]\displaystyle{ −\log p(c) }[/math]. Notice that quantifying information content in this way makes intuitive sense in this setting: as probability increases, informativeness decreases, so the more abstract a concept, the lower its information content. Moreover, if there is a unique top concept, its information content is 0.

This quantitative characterization of information provides a new way to measure semantic similarity. The more information two concepts share in common, the more similar they are, and the information shared by two concepts is indicated by the information content of the concepts that subsume them in the taxonomy. Formally, define

[math]\displaystyle{ sim(c_1, c_2) = \max_{c \in S(c_1, c_2)} [−\log p(c)] }[/math] (1)


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1995 UsingInformationContentToEvalSemSimPhilip ResnikUsing Information Content to Evaluate Semantic Similarity in a Taxonomyhttp://arxiv.org/PS cache/cmp-lg/pdf/9511/9511007v1.pdf