2002 AdaptingTheLeskAlgForWSD

From GM-RKB
Jump to: navigation, search

Subject Headings: Supervised Word Sense Disambiguation Algorithm, Lesk Algorithm.

Notes

Cited By

Quotes

Abstract

All human languages have words that can mean different things in different contexts. Word sense disambiguation is the process of automatically figuring out the intended meaning of such a word when used in a sentence. One of the several approaches proposed in the past is Michael Lesk’s 1986 algorithm. This algorithm is based on two assumptions. First, when two words are used in close proximity in a sentence, they must be talking of a related topic and second, if one sense each of the two words can be used to talk of the same topic, then their dictionary definitions must use some common words. For example, when the words ”pine cone” occur together, they are talking of ”evergreen trees”, and indeed one meaning each of these two words has the words ”evergreen” and ”tree” in their definitions. Thus we can disambiguate neighbouring words in a sentence by comparing their definitions and picking those senses whose definitions have the most number of common words.

The biggest drawback of this algorithm is that dictionary definitions are often very short and just do not have enough words for this algorithm to work well. We deal with this problem by adapting this algorithm to the semantically organized lexical database called WordNet. Besides storing words and their meaning like a normal dictionary, WordNet also ”connects” related words together. We overcome the problem of short definitions by looking for common words not only between the definitions of the words being disambiguated, but also between the definitions of words that are closely related to them in WordNet. We show that our algorithm achieves an 83% improvement in accuracy over the standard Lesk algorithm, and that it compares favourably with other systems evaluated on the same data.

Related Work

In this section we review six papers each of which describes research on word sense disambiguation using WordNet. [5] and [1] represent work that is closest to this thesis. The former shows how one may combat the combinatorial explosion when trying to simultaneously disambiguate all the words in a window of context while the latter describes a method to get the best combination of senses for a window of words, but using a non–Leskian approach.

[17] and [12] show how one may take advantage of the document in which the target word occurs for its disambiguation. While [17] uses words far away from the target word for the disambiguation of the target word, [12] reuses the senses previously assigned to other words in the document to guide the disambiguation of new words in the same document.

Finally, [10] and [13] propose ways of augmenting the unsupervised disambiguation algorithm with some knowledge derived from small quantities of manually sense–tagged data. Although our algorithm takes a purely unsupervised approach, we are always interested in combining the strengths of the two approaches in a bid to achieve better disambiguation without depending entirely on the supervised technique. These two papers give some ideas of how this may be done.

9.1 Approaches that Simultaneously Disambiguate All Words in the Window

Several approaches to word sense disambiguation using WordNet have been proposed in the past. Of these, the algorithm described in Lexical Disambiguation using Simulated Annealing [5] by Jim Cowie, Joe Guthrie and Louise Guthrie is perhaps the most closely related to our algorithms. In this paper the authors attempt to tackle a problem first posed in [11] – that of what order to follow when disambiguating all the content words in a sentence.

Recall that when the original Lesk algorithm [11] disambiguates a word in a sentence, it does not make use of the senses it has assigned to words prior to it in the same sentence. Another possible scheme is to evaluate all possible combinations of senses for the content words of a given sentence, and to pick that combination that is deemed the most appropriate. We call this the global scheme of disambiguation, and have experimented with it in chapter 5. We have shown that indeed this leads to slightly more accurate disambiguation than the method described by Lesk [11] over the same window size. However we have also shown that due to the combinatorial explosion of candidate sense combinations, this method is too intensive to be useful for us. This paper attempts to tackle this computationally intensive aspect through the use of an algorithm called simulated annealing.

This algorithm searches through a huge solution space for that solution that minimizes an error function. Some algorithms that attempt the same thing suffer from the local minima problem in which they get stuck at a point in the solution space which has a lower error than all points in its neighbourhood, but whose error is not the lowest global error. Simulated annealing attempts to reduce the probability of getting stuck in a local minimum by taking random jumps towards higher error points from lower error points during its search through the solution space. These jumps are taken more often in the initial stages of the search when it is more likely for the algorithm to get stuck in a local minimum point, and less often in the later stages when the algorithm is more likely to be close to the global minimum.

To adapt this algorithm to the problem of word sense disambiguation, one must define both a solution space and an error function. Given a sentence, this paper defines a configuration of senses in exactly the same way as we define a candidate combination of senses in chapter 5. Thus given words, each configuration consists of one sense each for each of the those words, and is distinct from every other configuration for the given words. The solution space is made up of all such configurations.

Approaches that Combine Unsupervised and Supervised Algorithms

Besides purely unsupervised and supervised approaches, there have been several attempts to combine the strengths of these two kinds of algorithms. We review below two papers that propose methods of augmenting a WordNet based unsupervised approach to word sense disambiguation with some amount of knowledge in the form of statistics derived from manually sense–tagged data.

Combining Local Context and WordNet Similarity for Word Sense Identification [10] by Claudia Leacock and Martin Chodorow describes two different approaches to WSD, one that uses local context and another that uses WordNet to measure similarity between words, and then attempts to combine these two approaches to create a single better classifier. The authors test their algorithms on the four senses of serve. The local context of serve is defined in this paper to consist of three pieces of information: the part–of– speech tags of the neighbouring words, the closed class words (words that are not marked as nouns, verbs, adjectives or adverbs) and the open class words that occur in the neighbourhood of serve. For the part–of– speech information, a window of 5 words centered around the target word is selected and for each sense of serve, the frequency of each tag at each of the five positions is noted. This allows us to calculate the conditional probability of a sense of serve, given that a certain part-of-speech tag appears at a certain position in the window. Similar conditional probabilities are created for the closed class words. For the open class words, instead of maintaining the exact positional information, only two “positions” are remembered: LEFT and RIGHT. Further, the window size of open class words is left as a variable in the experiments. Once these probabilities are observed from the training data, occurrences of serve in the test data are disambiguated by choosing that sense of serve that has the highest combined probability, given the context of the occurrence of serve. The authors report an accuracy of 83% given an open class window size of 13 words and given a training size of 200 sentences. Although these results are very encouraging, the authors caution that this may be a consequence of the fact that only the most frequent 4 senses of serve were used for experimentation. The low frequency senses of serve account for around 10% of real–life data, and if these are always misclassified, then the maximum accuracy attainable would be 90%.

In the next experiment, the authors describe an algorithm that uses similarity measures between words in WordNet to disambiguate instances of the word serve. Two similarity measures are defined, both of which measure the degree of similarity between two nouns. In the first measure, a “path length” is computed between two nouns according to the least number of synsets one needs to cross when traversing the IS–A tree to get from any sense of the first noun to any sense of the second noun. Thus the shortest such path is when the two nouns have at least one sense each that belong to the same synset, and the path length in this case is 1. The second similarity metric defined in this paper is the ”Most Informative Class” measure. A class is defined as the set containing all the synonyms in a synset as well as all the synonyms in the hyponym synsets of this synset. The frequency of such a class is the sum of the frequencies of all the words in it, as observed in a large corpus; and from such a frequency the probability of the class can be measured. The similarity of two nouns is measured from the probability of the least probable, and therefore most informative, class that both nouns belong to.

[Note that the Lesk algorithm and particularly our adaptation of it to WordNet can be thought of as a similarity measure itself. The score produced by thought of as a measure of the similarity of those two synsets. [15] describes and further investigates this idea.]

To disambiguate a test instance of serve, an instance of serve in the training data is identified such that the first noun to its right either exactly matches or is the one most similar to the first noun to the right of the test instance of serve using either of these two similarity measures. The sense–tag of this training instance of serve is assigned to the test instance of serve. If a decision cannot be made using the noun on the right, the same thing is done with the first noun to the left of serve. Using this algorithm, the authors report accuracies of up to 54% using the path length measure and 52% with the most informative class measure when using a training size of 200 sentences.

Finally the authors describe one way of combining the local context algorithm with the WordNet similarity measure method. The authors search the test data for nouns that have not been used in the training data, and replace them with that noun in the training data they are most similar to. Once this is done, the local context algorithm is run to disambiguate the test instances of serve. The authors report a modest accuracy increase of only 1% when using 200 sentences as training data, but up to 3.5% when using only 10 sentences as training. This implies that perhaps this combined algorithm might be useful when the size of the training data is small.

The similarity measures defined in this paper differ quite a bit from our mechanism of measuring the similarity between two senses in that while the measures here depend on path lengths between synsets and information content etc, our notion of similarity is derived from Lesk’s [11] idea that related senses of words use the same words in their definitions. Further the algorithms in this paper are geared towards a training–test scenario instead of the totally unsupervised approach that we take. However, this paper shows that it may be possible to boost the accuracies obtained by our unsupervised approach when given a little bit of training data to refer to.

  • An Iterative Approach to Word Sense Disambiguation [13] by Rada Mihalcea and Dan I. Moldovan

presents an algorithm that disambiguates nouns and verbs in a given text in an iterative fashion. That is, given text containing words with multiple senses, this algorithm makes several passes over the text, each time assigning sense–tags to a few new words until as many words as possible have been disambiguated. The algorithm uses the hierarchical structure ofWordNet to measure the semantic relatedness of words, and SemCor [14], a hand–disambiguated corpus, to boot-strap the algorithm. The algorithm makes 10 passes or iterations over the input text. ...

Future Work

We have shown that the classic Lesk algorithm for word sense disambiguation can be extended to the lexical database WordNet in a way that improves disambiguation accuracy. We have also shown how certain relations in WordNet are particularly adept at accurately discriminating between appropriate and inappropriate senses for words, where appropriateness is decided by the context in which the word occurs. We find that the accuracies achieved by our system compares favourably with those achieved on the same data by the participants of the SENSEVAL–2 word sense disambiguation exercise.

However, our best precision values continue to hover around the 35 to 40% mark. This is half as good as the precision values achieved by the best supervised algorithms, and is certainly something we would like to improve. Further, in this thesis we have also been equally interested in gaining an understanding of the various strengths and limitations of the basic method underlying our algorithms. We propose below certain directions in which further research could be done to both improve disambiguation accuracy and to better understand the underlying process.

Applying the Global Disambiguation Algorithm to I.R.

We have hypothesized that the global disambiguation strategy discussed and experimented with in chapter 5 is a useful algorithm when the words in the window of context “belong together”. For instance, this algorithm does remarkably well in disambiguating the words in the sentences time flies like an arrow and fruit flies like a banana. However, the data we have used for experimentation generally consists of long sentences of flowing text where most words in a given sentence are not strongly related to each other, and only small pockets of related words may be found in them. We have shown that this sort of data does not require the exhaustive series of comparisons that the global method performs, and in fact lends itself to the local scheme of disambiguation.

One way to test whether the global strategy does in fact perform better when given a set of highly related words is to apply it to the realm of information retrieval. Queries submitted to information retrieval systems are often short and consist of words that are closely related to each other. For example the query string hospital aids patient contains three words such that a particular sense of each word is highly appropriate when these three words appear together. We believe that the global algorithm will perform well in disambiguating the words in this query which will help the information retrieval system.

Using Lexical Chains

One of the drawbacks of our algorithms is that most of the words in the window of context are not really related to the target word. This adversely affects the global method which must then perform multitudes of comparisons between glosses that are not related. This also affects both the global and the local method in that comparisons between words none of whose senses are related to each other can only contribute spurious chance overlaps that can potentially mislead the disambiguation algorithm.

One possible way to handle this limitation could be to detect those words in the context of the target word that are most likely to be related to it, and to then use only these words to perform disambiguation. [8] proposes the idea of lexical chains which could help us to do this. The idea of lexical chains is based on the intuition that coherent text usually contains multiple references to the same idea or to a group of related ideas. These references often employ words that are synonyms of each other, are closely related through one of the relationships of WordNet (like part–of, is–a, etc) or simply topically related to each other (eg, Antarctica and penguin). [8] describes a method to form lexical chains where each such chain links together words in a text that are closely related to each other.

We could apply this idea of lexical chains to our algorithms by selecting from the context of the target word only those words that form lexical chains with the target word, and then using just these words for disambiguation. We believe that this should help us get rid of the unrelated words that only add noise to the disambiguation algorithm.

Several issues may arise in such an adaptation. First, the lexical chains formed by [8] actually link together senses of words, and not just the words themselves. Thus for example we could have a chain that links the first noun sense of the target word with the second verb sense of a word to its (not necessarily immediate) left and the first noun sense of a word to its right. It is possible that one is unable to form even a single chain with a certain sense of the target word. This can happen when that sense is totally unrelated to the surrounding context. For example, in a discourse on finance, it is unlikely that a lexical chain can be formed with the river bank sense of the word bank. Finding the lexical chains involving the target word therefore has the effect of doing a partial disambiguation which rules out some of the obviously unrelated senses of a given word.

Improving the Matching with Disambiguated Glosses

Our algorithm is crucially dependent on the matching of words between glosses, example strings etc. Thus far we have stemmed the words in the glosses and example strings, and then sought to find exact matches between them. Recall that our premise in looking for word matches between glosses is based on Lesk’s intuition that definitions of related senses will use the same words. However the words used in the definitions may be polysemous themselves....

  • Further, this can also help us in reporting matches between two words that are synonyms of each other. It

is possible that two glosses use two different synonyms of the same word; with our present algorithm we would not be able to match these two synonyms. However with disambiguated glosses, these two synonyms would be tagged with the same sense, and since we are matching senses, we would indeed return these two words as a successful match.

  • Fortunately a version of WordNet which has all the words in its glosses pre–disambiguated is on the anvil as of spring 2002 [7]. Besides disambiguating glosses, this version of WordNet also promises to include links between words that are morphological variations of each other. [7] cites connections between nouns and verbs that are variations of each other as an example of relations that are both useful and absent in the current version of WordNet. For example, there is no link between the noun summary and the verb summarize. Several researchers have commented on the lack of links between synsets belonging to the various parts of speech as a limitation of WordNet. It will be interesting to experiment with such links as well as with the disambiguated glosses once the enhanced version of WordNet is made available.

References

  • [1] Eneko Agirre and G. Rigau. Word sense disambiguation using conceptual density. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), pages 16–22, Copenhagen, Denmark, 1996.
  • [2] (Banerjee and Pedersen, 2002) ⇒ Satanjeev Banerjee, and Ted Pedersen. (2002). “An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet.” In: Proceedings of CICLing (2002). Lecture Notes In Computer Science; Vol. 2276.
  • [3] Eric D. Brill. A simple rule-based part of speech tagger. In: Proceedings of the Third Conference on Applied Computational Linguistics, Trento, Italy, 1992.
  • [4] Y. Choueka and S. Lusignan. Disambiguation by short contexts. Computers and the Humanities, 19:147–157, 1985.
  • [5] J. Cowie, J. Guthrie, and L. Guthrie. Lexical disambiguation using simulated annealing. In: Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pages 1172–1176, Nantes, France, 1992.
  • [6] C. Fellbaum, editor. WordNet: An electronic lexical database. MIT Press, 1998.
  • [7] Sanda M. Harabagiu, George A. Miller, and Dan Moldovan. WordNet 2 – a morphologically and semantically enhanced resource. In: Proceedings of SIGLEX–99, pages 1–8, College Park, MD, June 1999.
  • [8] Graeme Hirst and D. St-Onge. Lexical chains as representations of context for the detection and correction of malapropisms. In Fellbaum, pp. 305–332, 1998.
  • [9] http://www.sle.sharp.co.uk/senseval2, 2001.
  • [10] C. Leacock and M. Chodorow. Combining local context and WordNet similarity for word sense identification. In Fellbaum, pp. 265–283, 1998.
  • [11] (Lesk, 1986) ⇒ Michael Lesk. (1986). “Automatic Sense Disambiguation Uusing Machine Readable Dictionaries: How to tell a pine cone from a ice cream cone.” In: Proceedings of SIGDOC-1986.
  • [12] X. Li, S. Szpakowicz, and Stan Matwin. A WordNet-based algorithm for word sense disambiguation. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, August 1995.
  • [13] Rada Mihalcea and Dan Moldovan. An iterative approach to word sense disambiguation. In: Proceedings of Flairs 2000, pages 219–223, Orlando, FL, May 2000.
  • [14] George A. Miller, C. Leacock, R. Tengi, and T. Bunker. A semantic concordance. In: Proceedings of ARPA workshop on Human Language Technology, pages 303–308, Plainsboro, NJ, 1977.
  • [15] S. Patwardhan, S. Banerjee, and T. Pedersen. Using measures of semantic relatedness for word sense disambiguation. In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, February 2003.
  • [16] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 2nd edition, 1979.
  • [17] E. Voorhees. Using WordNet to disambiguate word senses for text retrieval. In: Proceedings of SIGIR ’93, pages 171–180, Pittsburgh, PA, 1993.
  • [18] G. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, MA, 1949.,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2002 AdaptingTheLeskAlgForWSDSatanjeev Banerjeehttp://www.d.umn.edu/~tpederse/Pubs/banerjee.pdf2002

[[title::Adapting the Lesk algorithm for Word Sense Disambiguation to WordNet|]]