Text Tokenization System

(Redirected from tokenizer)
Jump to navigation Jump to search

A Text Tokenization System is a data processing system that implements a tokenization algorithm to solve a tokenization task.



  • http://en.wiktionary.org/wiki/tokenizer
    • (computing) A system that parses an input stream into tokens.
  • http://www.xrce.xerox.com/competencies/content-analysis/fssoft/docs/tokenize-97/tokenize.html
    • Lexical tokenization requires a tokenizing transducer to break a given text into a sequence of tokens. This is language dependent.
    • Input: Raw text in character form.
    • Output: The output will be tokenized, one token on a line.
    • Currently supported languages: English (eng), Czech (cze), Danish (dan), Finnish (fin), French (fre), Italian (ita), Norwegian (nor, bok), Polish (pol), Russian (rus), Spanish(spa), Swedish (swe).
    • Also: http://www.cis.upenn.edu/~cis639/docs/tokenize.html
  • http://www.let.rug.nl/~vannoord/Fsa/
    • This page describes the FSA Utilities toolbox: a collection of utilities to manipulate regular expressions, finite-state automata and finite-state transducers. Manipulations include automata construction from regular expresssions, determinization (both for finite-state acceptors and finite-state transducers), minimization, composition, complementation, intersection, Kleene closure, etc. Various visualization tools are available to browse finite-state automata. Interpreters are provided to apply finite automata. Finite automata can also be compiled into stand-alone C programs. FSA6 extends FSA5 by allowing predicates on arcs instead of atomic symbols. If you want to compile FSA yourself, then you need SICStus Prolog, SWI-Prolog or YAP. In addition, there are binaries available for various platforms (created with SICStus Prolog; you don't need SICStus Prolog in order to use these binaries). The toolbox comes with an optional graphical user interface (SICStus Prolog only) and an optional command interpreter. The toolbox can also be applied as a UNIX filter, or as a Prolog library.