2006 AnEndToEndSupervTargetWSDSystem

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Word-Sense Disambiguation System, GATE, WEKA.

Notes

Cited By

Quotes

Abstract

We present an extensible supervised Target-Word Sense Disambiguation system that leverages upon GATE (General Architecture for Text Engineering), NSP (Ngram Statistics Package) and WEKA (Waikato Environment for Knowledge Analysis) to present an end-to-end solution that integrates feature identification, feature extraction, preprocessing and classification.

Introduction

Word Sense Disambiguation (WSD) is the task of automatically deciding the sense of an ambiguous word based on its surrounding context. The correct sense is usually chosen from a predefined set of senses, known as the sense inventory. In target-word sense disambiguation the scope is limited to assigning meaning to occurrences of a few predefined target words in the given corpus of text.

Most popular approaches to WSD use supervised machine learning methods to train a classifier using a set of labeled instances of the ambiguous word and create a statistical model. This model is then applied to unlabeled instances of the ambiguous word to decide their correct sense. In such approaches, the ability to run several experiments based on the choice of (i) features; and (ii) the classifier along with its parameters, is the key factor in determining the configuration that yields the best accuracy for the task under consideration. This is exactly what our system facilitates - an end-to-end interface for running several WSD experiments, with the choice of features using many existing and one new GATE (Cunningham et al. 2002) component and the choice of classifiers from WEKA (Witten & Frank 2005).

Background and Related Systems

The GATE framework for text engineering and analysis provides several components that identify and annotate features such as Part-of-Speech tags, named entities and syntactic features including noun phrases and verb phrases. In addition, it also provides components for using machine learning methods for text analysis and for integrating WEKA classifiers in GATE. However, being a general purpose system by design, there are certain limitations in the GATE framework that prevent it from being used as an off-the-shelf end-to-end WSD system. First, we found the feature identification method provided by the machine learning component in GATE to be restrictive in two ways: (i) In a target-word scenario, it does not allow extraction of features from annotations that do not surround the target word; (ii) it does not include extraction of floating features – where one might be interested in extracting a certain feature or sets of features which are at a variable distance from the target word. Second, from the machine learning perspective - the framework does not integrate abilities such as: (i) generation of nominal features from string-valued features, (ii) cross-validation experiments and (iii) automation of train-test experiments. Our system builds upon the GATE framework by addressing these limitations.

Other publicly available supervised target-word sense disambiguation systems include SenseTools (Pedersen 2001), SyntaLex (Mohammad & Pedersen 2004) and WSD Shell. SenseTools and WSD Shell (which is built on top of Sense-Tools) are a set of modular Perl programs that integrate feature identification using the Ngram Statistics Package (NSP) (Banerjee & Pedersen 2003) and machine learning using WEKA. SyntaLex adds Part-of-Speech and syntactic features to the SenseTools system. The only limitation that these systems have is the lack of off-the-shelf components for: (i) integration with other knowledge sources such as WordNet and (ii) using features such as named entities, coreference resolutions and morphological forms. Our system benefits from the extensive set of components that are already available in GATE.

References

  • Banerjee, S., and Pedersen, T. (2003). The Design, Implementation and Use of the Ngram Statistics Package. In: Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics.
  • Hamish Cunningham; Diana Maynard; Bontcheva, K.; and Tablan, V. (2002). GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 168–175.
  • Mohammad, S., and Pedersen, T. (2004). Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to SENSEVAL-3. In: Proceedings of the Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text (SENSEVAL-3).
  • Pedersen, T. (2001). Machine Learning with Lexical Features: The Duluth Approach to SENSEVAL-2. In: Proceedings of the Second International Workshop on Evaluating Word Sense Disambiguation Systems (SENSEVAL-2).
  • Ian H. Witten, and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.
  • http://www.d.umn.edu/~tpederse/sensetools.html
  • http://www.d.umn.edu/~tpederse/syntalex.html
  • http://www.d.umn.edu/~tpederse/wsdshell.html,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 AnEndToEndSupervTargetWSDSystemSerguei V.S. Pakhomov
Ted Pedersen
Mahesh Joshi
Richard Maclin
Christopher Chute
An End-to-end Supervised Target-Word Sense Disambiguation Systemhttp://www.d.umn.edu/~tpederse/Pubs/aaai06-mahesh-demo.pdf