Written Word Segmentation Task

From GM-RKB
Jump to navigation Jump to search

A Written Word Segmentation Task is a text segmentation task that is a word identification task.



References

2022

  • (Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation Retrieved:2022-3-21.
    • Word segmentation is the problem of dividing a string of written language into its component words.

      In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter), although this concept has limits because of the variability with which languages emically regard collocations and compounds. Many English compound nouns are variably written (for example, ice box = ice-box = icebox; pig sty = pig-sty = pigsty) with a corresponding variation in whether speakers think of them as noun phrases or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast, German compound nouns show less orthographic variation, with solidification being a stronger norm.

      However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.

      In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.

      The Unicode Consortium has published a Standard Annex on Text Segmentation, [1] exploring the issues of segmentation in multiscript texts.

      Word splitting is the process of parsing concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.

      Word splitting may also refer to the process of hyphenation.

2009

  • (Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation
    • Word segmentation is the problem of dividing a string of written language into its component words. In English and many other modern languages using some form of the Latin alphabet dividing text using the space character is a good approximation to word segmentation. (Some examples where the space character alone may not be sufficient include contractions like can't for can not.) However the equivalent to this character is not found in all written scripts and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese and Thai.