657,267
edits
m (Text replacement - "4]]) ⇒ " to "4]]) ⇒ ") |
No edit summary |
||
Line 1: | Line 1: | ||
* ([[1994_WhatIsAWord|Grefenstette & Tapanainen, 1994]]) ⇒ [[author::Gregory Grefenstette]], [[author::Pasi Tapanainen]]. ([[year::1994]]). "[http://www.personal.psu.edu/xxl13/teaching/sp07/apling597e/resources/Grefenstette_Tapanainen_1994.pdf What is a Word, What is a Sentence? Problems of Tokenization]." In: [[journal::Proceedings of 3rd Conference on Computational Lexicography and Text Research]] (COMPLEX 1994). | * ([[1994_WhatIsAWord|Grefenstette & Tapanainen, 1994]]) ⇒ [[author::Gregory Grefenstette]], and [[author::Pasi Tapanainen]]. ([[year::1994]]). "[http://www.personal.psu.edu/xxl13/teaching/sp07/apling597e/resources/Grefenstette_Tapanainen_1994.pdf What is a Word, What is a Sentence? Problems of Tokenization]." In: [[journal::Proceedings of 3rd Conference on Computational Lexicography and Text Research]] (COMPLEX 1994). | ||
<b>Subject Headings:</B> [[Surface Word Segmentation Task]], [[Tokenization Task]]. | <b>Subject Headings:</B> [[Surface Word Segmentation Task]], [[Tokenization Task]]. | ||
Line 8: | Line 8: | ||
===Abstract=== | ===Abstract=== | ||
Any linguistic treatment of [[Free Text|freely occurring text]] must provide an answer to what is considered as a [[token]]. In [[artificial languages]], the definition of what is considered as a [[token]] can be [[precisely and unambiguously defined]]. | |||
[[Natural languages]], on the other hand, display such a rich variety that there are many ways to decide upon what will be considered as a [[unit]] for a [[computational approach]] to [[text]]. | |||
Here we will discuss [[Tokenization Task|tokenization]] as a problem for [[computational lexicography]]. | |||
Our discussion will cover the aspects of what is usually considered preprocessing of text in order to prepare it for some automated treatment. | |||
[[We]] present the roles of tokenization, methods of tokenizing, grammars for recognizing acronyms, abbreviations, and regular expressions such as numbers and dates. | |||
[[We]] present the problems encountered and discuss the effects of seemingly innocent choices. | |||
=== 1 Introduction=== | === 1 Introduction=== | ||
The linguistic exploitation of naturally occurring text can be seen as a progression of transformations of the original text. | |||
... | |||
__NOTOC__ | __NOTOC__ |