2008 SupersenseTaggerForItalian

From GM-RKB
Jump to navigation Jump to search

Subject Headings: SuperSenseTagger.

Notes

Quotes

Abstract

In this paper we present the procedure we followed to develop the Italian Super Sense Tagger. In particular, we adapted the English SuperSense Tagger to the Italian Language by exploiting a parallel sense labeled corpus for training. As for English, the Italian tagger uses a fixed set of 26 semantic labels, called supersenses, achieving a slightly lower accuracy due to the lower quality of the Italian training data. Both taggers accomplish the same task of identifying entities and concepts belonging to a common set of ontological types. This parallelism allows us to define effective methodologies for a broad range of cross-language knowledge acquisition tasks

1. Introduction

One interesting alternative to traditional NER categories are the most general, or top-level, categories defined by WordNet. WordNet as been organized according to psycholinguistic theories on the principles governing lexical memory. As an example, several psycho-linguistic experiments discussed in (Miller, 1990) suggest correlations between reaction times and the hierarchical structural of the lexicon. Thus the broadest WordNet’s categories can serve as a principled basis for a set of categories which exhaustively covers, at least as a first rough approximation, all possible concepts occurring in a sentence. An additional advantage of such categories is that, in principle, they should be categories which are shared across different languages. Thus, semantic annotations of this kind could be used for multilingual inference in several language tasks; e.g., information retrieval or machine translation.

To this aim, (Ciaramita and Johnson, 2003) developed a SuperSense Tagging (SST) technology for English, demonstrating that reasonably high accuracy in tagging can be obtained even in open domain contexts. This technology has been also adopted for Ontology Learning (Picca et al., May 2007), as the top level WordNet SuperSenses cover almost any high level ontological type of interest in ontology design. Section 2. describes the main features of the English SST.

In this paper we investigate the problem of developing a tagger based on WordNet semantic categories for Italian. The basic idea is that, being the WordNet supersenses inherently multilingual, the SST technology can be adopted for multilingual ontology learning problems. To this aim, we ported the SST technology to Italian, by training the supervised learning algorithm at the basis of the English distribution of the SST on an Italian sense tagged corpus, called MultiSemCor (Bentivogli et al., 2004).

2. The English SuperSense Tagger

... Using the Semcor corpus, a fraction of the Brown corpus annotated with WordNet word senses, a SST has been implemented (Ciaramita and Altun, 2006) which can be used for annotating large collections of English text 2. The SST implements a Hidden Markov Model, trained with the perceptron algorithm introduced in (Collins, 2002). Perceptron sequence learning provides an excellent trade-off accuracy/performance, sometimes outperforming more complex models such as CRF (Nguyen and Guo, 2007).

5. Conclusion and future work

In this paper we presented a new Italian SuperSense Tagger able to recognize named entities and concepts in texts achieving reasonably high accuracy, even if much lower than the English counterpart. Anyhow, the achieved precision is reasonably high for the tagger to be applied in knowledge acquisition tasks.

These results are encouraging and this research deserves further investigations. First of all we are going to develop automatic techniques based on parallel corpora to develop SST for other languages, such as German and French, without exploiting any labeled data. Secondly, combing this tagger with the English one already developed, we offer a new multilingual tool, covering an higher spectrum of categories than traditional Named Entity Recognition systems. Being the category set totally aligned among languages, the tool can be profitably used as a preprocessing step for bilingual dictionary induction, multilingual ontology learning, and so on. Another direction we are following is the development of a new generation SST which is able to distinguish between concepts and instances of the same type. Finally, we are going to develop a WEB service able to extract terminology belonging to different supersenses from the analysis of corpora and WEB pages in multiple languages.

References

  • Luisa Bentivogli, Pamela Forner, and Emanuele Pianta. (2004). Evaluating cross-language annotation transfer in the multisemcor corpus. In COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, page 364, Morristown, NJ, USA. Association for Computational Linguistics.
  • Massimiliano Ciaramita and Y. Altun. (2006). Broad-coverage sense disambiguation and information extraction with a supersense sequence tagger. In: Proceedings of EMNLP-06, pages 594–602, Sydney, Australia.
  • Massimiliano Ciaramita and J. Atserias. (2007). Pos tagging with a named entity tagger. Intelligenza Artificiale, 4:28–29.
  • Massimiliano Ciaramita and M. Johnson. (2003). Supersense tagging of unknown nouns in wordnet. In: Proceedings of EMNLP-03, pages 168–175, Sapporo, Japan.
  • M. Collins. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In: Proceedings of EMNLP-02.
  • C. Fellbaum. (1998). WordNet. An Electronic Lexical Database. MIT Press.
  • Ralph Grishman and Beth Sundheim. (1996). Message understanding conference-6: a brief history. In: Proceedings of the 16th conference on Computational linguistics, pages 466–471, Morristown, NJ, USA. Association for Computational Linguistics.
  • T. Koo and M. Collins. (2005). Hidden-variable models for discriminative reranking. In: Proceedings of EMNLP-05, Vancouver, Canada.
  • George A. Miller. (1990). Nouns in wordnet: a lexical inheritance system,. International Journal of Lexicography, 3(4):245–264.
  • Nam Nguyen and Yunsong Guo. (2007). Comparison of sequence labeling algorithms and extensions. In: Proceedings of ICML 2007, pages 681–688.
  • Davide Picca, Alfio Gliozzo, and Massimiliano Ciaramita. May 2007. Semantic domains and supersense tagging for domain-specific ontology learning. In proceedings RIAO 2007.
  • S Sekine. (2004). Named entity: History and future.

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2008 SupersenseTaggerForItalianDavide Picca
Massimiliano Ciaramita
Alfio Massimiliano Gliozzo
Supersense Tagger for Italianhttp://www.lrec-conf.org/proceedings/lrec2008/pdf/599 paper.pdf