2004 FindingPredominantSensesInUntaggedText

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

1 Introduction

  • The first sense heuristic which is often used as a baseline for supervised WSD systems outperforms many of these systems which take surrounding context into account. This is shown by the results of the English all-words task in SENSEVAL-2 (Cotton et al., 1998) in figure 1 below, where the first sense is that listed in WordNet for the PoS given by the Penn TreeBank (Palmer et al., 2001). The senses in WordNet are ordered according to the frequency data in the manually tagged resource SemCor (Miller et al., 1993). Senses that have not occurred in SemCor are ordered arbitrarily and after those senses of the word that have occurred. The figure distinguishes systems which make use of hand-tagged data (using HTD) such as SemCor, from those that do not (without HTD). The high performance of the first sense baseline is due to the skewed frequency distribution of word senses. Even systems which show superior performance to this heuristic often make use of the heuristic where evidence from the context is not sufficient (Hoste et al., 2001). Whilst a first sense heuristic based on a sense-tagged corpus such as SemCor is clearly useful, there is a strong case for obtaining a first, or predominant, sense from untagged corpus data so that a WSD system can be tuned to the genre or domain at hand.
  • SemCor comprises a relatively small sample of 250,000 words. There are words where the first sense in WordNet is counter-intuitive, because of the size of the corpus, and because where the frequency data does not indicate a first sense, the ordering is arbitrary. For example the first sense of tiger in WordNet is audacious person whereas one might expect that carnivorous animal is a more common usage. There are only a couple of instances of tiger within SemCor. Another example is embryo, which does not occur at all in SemCor and the first sense is listed as rudimentary plant rather than the anticipated fertilised egg meaning. We believe that an automatic means of finding a predominant sense would be useful for systems that use it as a means of backing-off (Wilks and Stevenson, 1998; Hoste et al., 2001) and for systems that use it in lexical acquisition (McCarthy, 1997; Merlo and Leybold, 2001; Korhonen, 2002) because of the limited size of hand-tagged resources. More importantly, when working within a specific domain one would wish to tune the first sense heuristic to the domain at hand. The first sense of star in SemCor is celestial body, however, if one were disambiguating popular news celebrity would be preferred.

2 Method

  • In order to find the predominant sense of a target word we use a thesaurus acquired from automatically parsed text based on the method of Lin (1998). This provides the [math]\displaystyle{ k }[/math] nearest neighbours to each target word, along with the distributional similarity score between the target word and its neighbour. We then use the WordNet similarity package (Patwardhan and Pedersen, 2003) to give us a semantic similarity measure (hereafter referred to as the WordNet similarity measure) to weight the contribution that each neighbour makes to the various senses of the target word.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2004 FindingPredominantSensesInUntaggedTextDiana McCarthy
Rob Koeling
Julie Weeds
John Carroll
Finding Predominant Senses in Untagged Texthttp://acl.ldc.upenn.edu/acl2004/main/pdf/95 pdf 2-col.pdf