1994 WhatIsAWord
- (Grefenstette & Tapanainen, 1994) ⇒ Gregory Grefenstette, and Pasi Tapanainen. (1994). “What is a Word, What is a Sentence? Problems of Tokenization.” In: Proceedings of 3rd Conference on Computational Lexicography and Text Research (COMPLEX 1994).
Subject Headings: Surface Word Segmentation Task, Tokenization Task.
Notes
Quotes
Abstract
Any linguistic treatment of freely occurring text must provide an answer to what is considered as a token. In artificial languages, the definition of what is considered as a token can be precisely and unambiguously defined. Natural languages, on the other hand, display such a rich variety that there are many ways to decide upon what will be considered as a unit for a computational approach to text. Here we will discuss tokenization as a problem for computational lexicography. Our discussion will cover the aspects of what is usually considered preprocessing of text in order to prepare it for some automated treatment. We present the roles of tokenization, methods of tokenizing, grammars for recognizing acronyms, abbreviations, and regular expressions such as numbers and dates. We present the problems encountered and discuss the effects of seemingly innocent choices.
1 Introduction
The linguistic exploitation of naturally occurring text can be seen as a progression of transformations of the original text.
…
,
Author | Gregory Grefenstette + and Pasi Tapanainen + |
journal | Proceedings of 3rd Conference on Computational Lexicography and Text Research + |
title | What is a Word, What is a Sentence? Problems of Tokenization + |
titleUrl | http://www.personal.psu.edu/xxl13/teaching/sp07/apling597e/resources/Grefenstette Tapanainen 1994.pdf + |
year | 1994 + |