Subject Headings: Surface Word Segmentation Task.


10.2 Word Segmentation

  • The first step in the majority of text processing applications is to segment text into words. The term 'word', however, is ambiguous: a word from a language's vocabulary can occur many times in the text but it is still a single individual word of the language. So there is a distinction between words of vocabulary or word types and multiple occurrences of these words in the text which are called word tokens. This is why the process of segmenting words tokens in text is called tokenization. Although the distinction between word types and word tokens is important it is usual to refer to the both as 'words' whenever the context unambiguously implies the interpretation.

10.2.2 Hyphenated Words

  • Hyphenated segments present a case of ambiguity for a tokenizer - sometimes a hyphen is part of a token, i.e. self-assessment,, F-16, forty-two and sometimes it is not e.g. New York-based. Essentially, segmentation of hyphenated words answer a question 'One word or two?'



