2003 TextSegmentation

Subject Headings: Surface Word Segmentation Task.

Notes

(Grover et al., 2000) ⇒ Claire Grover, Colin Matheson, Andrei Mikheev, and Marc Moens. (2000). “LT TTT - A flexible tokenisation tool. “ In: Proceedings of LREC-2000.
(Ge et al., 1999) ⇒ Xianping Ge, Wanda Pratt, and Padhraic Smyth. (1999). “Discovering Chinese Words from Unsegmented Text.” In: Proceedings of [[SIGIR]-1999.
(Sproat et al., 1996) ⇒ Richard Sproat, William A. Gale, Chilin Shih, and Nancy Chang. (1996). “A Stochastic Finite-state Word-Segmentation Algorithm for Chinese.” In: Computational Linguistics, 22(3).
(Grefenstette & Tapanainen) ⇒ Gregory Grefenstette, and Pasi Tapanainen. (1994). “What is a Word, What is a Sentence? Problems of Tokenization.” In: Proceedings of 3rd Conference on Computational Lexicography and Text Research (COMPLEX 1994).

The first step in the majority of text processing applications is to segment text into words. The term 'word', however, is ambiguous: a word from a language's vocabulary can occur many times in the text but it is still a single individual word of the language. So there is a distinction between words of vocabulary or word types and multiple occurrences of these words in the text which are called word tokens. This is why the process of segmenting words tokens in text is called tokenization. Although the distinction between word types and word tokens is important it is usual to refer to the both as 'words' whenever the context unambiguously implies the interpretation.

Hyphenated segments present a case of ambiguity for a tokenizer - sometimes a hyphen is part of a token, i.e. self-assessment,, F-16, forty-two and sometimes it is not e.g. New York-based. Essentially, segmentation of hyphenated words answer a question 'One word or two?'

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2003 TextSegmentation	Andrei Mikheev			Text Segmentation			http://books.google.com/books?id=OaClhre-vW4C&pg=PA201			2003