Written Word Segmentation Task

A Written Word Segmentation Task is a text segmentation task that is a word identification task.

Context:
- It can be solved by a Written Word Segmentation System (that implements a Written Word Segmentation Algorithm).
- ...
Example(s):
- WMST("I'm coming home”) ⇒ ([I] ['m] [coming] [home]).
- WMST("I bought a real time operating system”) ⇒ ([I] [bought] [a] [real time] [operating system]).
- WMST("日文章魚怎麼說") ⇒ ([日文] [章魚] [怎麼] [說]) (i.e. ~[Japanese] [octopus] [how] [say]).
- VWST("Famous notaries public include ex-attorney generals.") ⇒ ([Famous] [notaries public] [include] [ex-] [attorney generals]).
- WMST("Der Lebensversicherungsgesellschaftsangestellte kam gestern mit seinem Deutscher Schäferhund.” (~The life insurance company employee came yesterday with their German Shepherd) ⇒ ([Der], [Lebensversicherungs], [gesellschafts], [angestellte], [kam], [gestern], [mit], [seinem], [Deutscher Schäferhund])
  - notice that both “life insurance" and "insurance company” may exist in a lexicon but "... [life] [insurance company]..” is incorrect.
- WMST("The ex-governor general's sisters-in-law saw the wolves' den near Mr. Smith's home in Sault Ste. Marie.”) ⇒ ([The] [ex-] [governor general's] [sisters-in-law] [saw] [the] [wolves'] [den] [near] [Mr. Smith's] [home] [in] [Sault Ste. Marie].”).
- any Term Mention Segmentation Task.
- any Entity Mention Detection Task.
  - a Named Entity Mention Detection Task.
  - a Nominal Entity Mention Detection Task.
- [PWST]](#Imcominghome) ⇒ ([#] [I] [m] [coming] [home]).
- more Word Segmentation Task Examples.
- ...
See: Orthographic Word Segmentation Task, Syllabification, Space (Punctuation), Word Divider, Delimiter.

References

2022

(Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation Retrieved:2022-3-21.
- Word segmentation is the problem of dividing a string of written language into its component words.
  In English and many other languages using some form of the Latin alphabet, the space is a good approximation of a word divider (word delimiter), although this concept has limits because of the variability with which languages emically regard collocations and compounds. Many English compound nouns are variably written (for example, ice box = ice-box = icebox; pig sty = pig-sty = pigsty) with a corresponding variation in whether speakers think of them as noun phrases or single nouns; there are trends in how norms are set, such as that open compounds often tend eventually to solidify by widespread convention, but variation remains systemic. In contrast, German compound nouns show less orthographic variation, with solidification being a stronger norm.
  However, the equivalent to the word space character is not found in all written scripts, and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese, where sentences but not words are delimited, Thai and Lao, where phrases and sentences but not words are delimited, and Vietnamese, where syllables but not words are delimited.
  In some writing systems however, such as the Ge'ez script used for Amharic and Tigrinya among other languages, words are explicitly delimited (at least historically) with a non-whitespace character.
  The Unicode Consortium has published a Standard Annex on Text Segmentation, ^[1] exploring the issues of segmentation in multiscript texts.
  Word splitting is the process of parsing concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist.
  Word splitting may also refer to the process of hyphenation.

↑ UAX #29

2009

(Wikipedia, 2009) ⇒ http://en.wikipedia.org/wiki/Text_segmentation#Word_segmentation
- Word segmentation is the problem of dividing a string of written language into its component words. In English and many other modern languages using some form of the Latin alphabet dividing text using the space character is a good approximation to word segmentation. (Some examples where the space character alone may not be sufficient include contractions like can't for can not.) However the equivalent to this character is not found in all written scripts and without it word segmentation is a difficult problem. Languages which do not have a trivial word segmentation process include Chinese, Japanese and Thai.

[1] UAX #29

[1]

Written Word Segmentation Task

References

2022

2009

Navigation menu

Search