2007 Tokenizing

Subject Headings: Surface Word Segmentation Task, Tokenization Algorithm.

Notes

Most text processing applications like POS taggers, parsers, stemmers, key word extractors, search engines etc. operate on words and sentences. Texts in their raw form, however, are just sequences of characters without explicit information about word and sentence boundaries. Before any further processing can be done, a text needs to be segmented into words and sentences. This process is called tokenization. Tokenization divides the character sequence into sentences and the sentences into tokens. Not only words are considered as tokens, but also numbers, punctuation marks, parentheses and quotation marks.
There is a fundamental difference in the tokenization of alphabetic languages and ideographic languages like Chinese. Alphabetic languages usually separate words by blanks, and a tokenizer which simply replaces whitespace with word boundaries and cuts off punctuation marks, parentheses, and quotation marks at both ends of a word, is already quite accurate. The only major problem is the disambiguation of periods which are ambiguous between abbreviation periods and sentence markers. Ideographic languages, on the other hand, provide no comparable information about sentence boundaries which makes tokenization a much harder task. The tokenization of alphabetic and ideographic languages are actually two rather different tasks which require different methods.

In alphabetic languages, words are usually surrounded by whitespace and optionally preceded and followed by punctuation marks, parentheses, or quotes. A simple tokenization rule can therefore be stated as follows: Split the character sequence at whitespace positions and cut off punctuation marks, parentheses, and quotes at both ends of the fragments to obtain the sequence of tokens. This simple rule is quite accurate because whitespace and punctuation are fairly reliable indicators of word boundaries. Yet, there are some problems which need to be solved.

The tokenization of texts from languages with ideographic writing systems (Chinese, Japanese, Korean) is more difficult because there are no blanks to indicate the word boundaries. A Chinese text consists of a sequence of characters which is only interrupted by punctuation. Most characters are words by themselves, but can also be part of multicharacter words. A few characters are affixes appearing exclusively in multi-character words.
There are four major tokenization approaches for these languages: rule-based methods, statistical methods based on word n-grams, tagging approaches, and systems which integrate tokenization with POS tagging or parsing.
The simple bigram tokenizer described above usually splits unknown words into smaller units. Many systems recognize unknown words by recombining these fragments in a second step, e.g. by means of support vector machines (Asahara et al., 2003). Others integrate the recognition of unknown words into the statistical model. Sun et al. (2002). and Wu et al. (2003) replace the bigram model by a Hidden Markov model with submodels for the recognition of names.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2007 Tokenizing	Helmut Schmid			Tokenizing			http://www.coli.uni-saarland.de/~schulte/Teaching/ESSLLI-06/Referenzen/Tokenisation/schmid-hsk-tok.pdf