Open main menu


2006 DeterminingWordSenseDomUsingAThesaurus


Cited By



The degree of dominance of a sense of a word is the proportion of occurrences of that sense in text. We propose four new methods to accurately determine word sense dominance using raw text and a published thesaurus. Unlike the McCarthy et al. (2004). system, these methods can be used on relatively small target texts, without the need for a similarly-sense-distributed auxiliary text. We perform an extensive evaluation using artificially generated thesaurus-sense-tagged data. In the process, we create a word–category co-occurrence matrix, which can be used for unsupervised word sense disambiguation and estimating distributional similarity of word senses, as well.

1 Introduction

The occurrences of the senses of a word usually have skewed distribution in text. Further, the distribution varies in accordance with the domain or topic of discussion. For example, the ‘assertion of illegality’ sense of charge is more frequent in the judicial domain, while in the domain of economics, the ‘expense/cost’ sense occurs more often. Formally, the degree of dominance of a particular sense of a word (target word) in a given text (target text) may be defined as the ratio of the occurrences of the sense to the total occurrences of the target word. The sense with the highest dominance in the target text is called the predominant sense of the target word.

2 Thesauri

Published thesauri, such as Roget’s and Macquarie, divide the English vocabulary into around a thousand categories. Each category has a list of semantically related words, which we will call category terms or c-terms for short. Words with multiple meanings may be listed in more than one category. For every word type in the vocabulary of the thesaurus, the index lists the categories that include it as a c-term. Categories roughly correspond to coarse senses of a word (Yarowsky, 1992), and the two terms will be used interchangeably. For example, in the Macquarie Thesaurus, bark is a c-term in the categories ‘animal noises’ and ‘membrane’. These categories represent the coarse senses of bark. Note that published thesauri are structurally quite different from the “thesaurus” automatically generated by Lin (1998), wherein a word has exactly one entry, and its neighbors may be semantically related to it in any of its senses. All future mentions of thesaurus will refer to a published thesaurus

While other sense inventories such as WordNet exist, use of a published thesaurus has three distinct advantages: (i) coarse senses — it is widely believed that the sense distinctions of WordNet are far too fine-grained (Agirre and Lopez de Lacalle Lekuona (2003) and citations therein); (ii) computational ease — with just around a thousand categories, the word–category matrix has a manageable size; (iii) widespread availability — thesauri are available (or can be created with relatively less effort) in numerous languages, while WordNet is available only for English and a few romance languages. We use the Macquarie Thesaurus (Bernard, 1986) for our experiments. It consists of 812 categories with around 176,000 c-terms and 98,000 word types. Note, however, that using a sense inventory other than WordNet will mean that we cannot directly compare performance with McCarthy et al. (2004), as that would require knowing exactly how thesaurus senses map to WordNet. Further, it has been argued that such a mapping across sense inventories is at best difficult and maybe impossible (Kilgarriff and Yallop (2001) and citations therein)


  • (Agirre & de Lacalle Lekuona, 2003) ⇒ Eneko Agirre and O. Lopez de Lacalle Lekuona. (2003). “Clustering WordNet Word Senses.” In: Proceedings of the Conference on Recent Advances on Natural Language Processing (RANLP 2003).
  • J.R.L. Bernard, editor. (1986). The Macquarie Thesaurus. Macquarie Library, Sydney, Australia.
  • Lou Burnard. (2000). Reference Guide for the British National Corpus (World Edition). Oxford University Computing Services.
  • Adam Kilgarriff and Colin Yallop. (2001). What’s in a thesaurus. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC), pages 1371–1379, Athens, Greece.
  • Claudia Leacock, Martin Chodrow, and George A. Miller. (1998). Using corpus statistics and WordNet relations for sense identification. Computational Linguistics, 24(1):147–165.
  • Dekang Lin. (1998). Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics (COLING-98), pages 768–773, Montreal, Canada.
  • (McCarthy et al., 2004) ⇒ Diana McCarthy, Rob Koeling, Julie Weeds, and John Carroll. (2004). “Finding Predominant Senses in Untagged Text.” In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004).
  • Saif Mohammad and Graeme Hirst. Submitted. Distributional measures as proxies for semantic relatedness.
  • Patrick Pantel. (2005). Inducing ontological co-occurrence vectors. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL-05), pages 125–132, Ann Arbor, Michigan.
  • Hinrich Schutze and Jan O. Pedersen. (1997). A cooccurrence-based thesaurus and two applications to information retreival. Information Processing and Management, 33(3):307–318.
  • David Sheskin. (2003). The handbook of parametric and nonparametric statistical procedures. CRC Press, Boca Raton, Florida.
  • Jean Veronis. (2005). Hyperlex: Lexical cartography for information retrieval. To appear in Computer Speech and Language. Special Issue on Word Sense Disambiguation.
  • David Yarowsky. (1992). Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora. In: Proceedings of the 14th International Conference on Computational Linguistics (COLING-92), pages 454–460, Nantes, France.,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 DeterminingWordSenseDomUsingAThesaurusSaif Mohammad
Graeme Hirst
Determining Word Sense Dominance Using a ThesaurusProceedings of EACL-2006