Subject Headings: Surface Word Segmentation Task.
- (Grover & al, 2000) ⇒ Claire Grover, Colin Matheson, Andrei Mikheev, and Marc Moens. (2000). "LT TTT - A flexible tokenisation tool. " In: Proceedings of LREC-2000.
- (Ge et al, 1999) ⇒ Xianping Ge, Wanda Pratt, and Padhraic Smyth. (1999). "Discovering Chinese Words from Unsegmented Text." In: Proceedings of [[SIGIR]-1999.
- (Sproat et al, 1996) ⇒ Richard Sproat, William A. Gale, Chilin Shih, and Nancy Chang. (1996). "A Stochastic Finite-state Word-Segmentation Algorithm for Chinese." In: Computational Linguistics, 22(3).
- (Grefenstette & Tapanainen) ⇒ Gregory Grefenstette, and Pasi Tapanainen. (1994). "What is a Word, What is a Sentence? Problems of Tokenization." In: Proceedings of 3rd Conference on Computational Lexicography and Text Research (COMPLEX 1994).
10.2 Word Segmentation
- The first step in the majority of text processing applications is to segment text into words. The term 'word', however, is ambiguous: a word from a language's vocabulary can occur many times in the text but it is still a single individual word of the language. So there is a distinction between words of vocabulary or word types and multiple occurrences of these words in the text which are called word tokens. This is why the process of segmenting words tokens in text is called tokenization. Although the distinction between word types and word tokens is important it is usual to refer to the both as 'words' whenever the context unambiguously implies the interpretation.
10.2.2 Hyphenated Words
- Hyphenated segments present a case of ambiguity for a tokenizer - sometimes a hyphen is part of a token, i.e. self-assessment,, F-16, forty-two and sometimes it is not e.g. New York-based. Essentially, segmentation of hyphenated words answer a question 'One word or two?'