Text Tokenization System

A Text Tokenization System is a data processing system that implements a tokenization algorithm to solve a tokenization task.

Example(s):
- nltk.tokenizer.punk Tokenizer, e.g. http://text-processing.com/demo/tokenize/
- LT TTT2 System (or LT TTT Tokenization System).
- SentencePiece.
- tiktoken.
- …
Counter-Example(s):
- a Lemmatizer.
- a Morphological Parser.
- a Word Segmentation System.
- an Optical Character Recognition System.
See: DNA Segmentation System.

References

2009

http://en.wiktionary.org/wiki/tokenizer
- (computing) A system that parses an input stream into tokens.
http://www.xrce.xerox.com/competencies/content-analysis/fssoft/docs/tokenize-97/tokenize.html
- Lexical tokenization requires a tokenizing transducer to break a given text into a sequence of tokens. This is language dependent.
- Input: Raw text in character form.
- Output: The output will be tokenized, one token on a line.
- Currently supported languages: English (eng), Czech (cze), Danish (dan), Finnish (fin), French (fre), Italian (ita), Norwegian (nor, bok), Polish (pol), Russian (rus), Spanish(spa), Swedish (swe).
- Also: http://www.cis.upenn.edu/~cis639/docs/tokenize.html
http://www.let.rug.nl/~vannoord/Fsa/
- This page describes the FSA Utilities toolbox: a collection of utilities to manipulate regular expressions, finite-state automata and finite-state transducers. Manipulations include automata construction from regular expresssions, determinization (both for finite-state acceptors and finite-state transducers), minimization, composition, complementation, intersection, Kleene closure, etc. Various visualization tools are available to browse finite-state automata. Interpreters are provided to apply finite automata. Finite automata can also be compiled into stand-alone C programs. FSA6 extends FSA5 by allowing predicates on arcs instead of atomic symbols. If you want to compile FSA yourself, then you need SICStus Prolog, SWI-Prolog or YAP. In addition, there are binaries available for various platforms (created with SICStus Prolog; you don't need SICStus Prolog in order to use these binaries). The toolbox comes with an optional graphical user interface (SICStus Prolog only) and an optional command interpreter. The toolbox can also be applied as a UNIX filter, or as a Prolog library.

2000

(Grover et al., 2000) ⇒ Claire Grover, Colin Matheson, Andrei Mikheev, and Marc Moens. (2000). “LT TTT - A flexible tokenisation tool.” In: Proceedings of LREC-2000.

1989

(Abney, 1989) ⇒ Steven P. Abney. (1989). “Parsing By Chunks.” In: The MIT Parsing Volume, 1988-89. Center for Cognitive Science, MIT.
- QUOTE: ... A typical natural language parser processes text in two stages. A tokenizer/morphological analyzer converts a stream of characters into a stream of words, and the parser proper converts a stream of words into a parsed sentence, or a stream of parsed sentences.

Text Tokenization System

References

2009

2000

1989

Navigation menu

Search