A word instance is an instance/linguistic utterance of a linguistic word (produced at some place and time).
- AKA: Terminal Word Mention, Surface Word.
- It can be within a linguistic expression instance.
- It can range from being an Atomic Word Instance (e.g. "[fire]!") to being an Embedded Word Instance (“The [fire] is out.”), within another linguistic utterance.
- It can have a Word Sense.
- It cannot be divided into smaller units without changing some of its Intension or Extension. E.g. “tea bag” <=> “bag of tea”.
- It can be associated with a Word Mention Boundary.
- It can be:
- a Written Word/Written Word Mention (a Grapheme String with a Word Spelling).
- a Spoken Word/Spoken Word Mention (a Phoneme String with a Word Pronunciation).
- a Signed Word with a Word Gesticulation.
- It can range from:
- being an Ambiguous Word Mention,
- to being an Unambiguous Word Mention.
- It can be identified by a Word Segmentation Task.
- It can be mapped to a Word Form (such as a Word Form Record) by a Word Mention Normalization Task.
- It can have a Word Mention Lemma (or really, be mapped to a Lexeme Lemma).
- All of the word mentions written in this concept description.
- “New York” in “[New York]-based Jamaicans are racking up the minutes.”
- Any Verb Mention.
- Any Noun Mention.
- a Word Form, which is an un-instantiated Abstract Concept.
- a Concept Mention, which can be composed of more than one word mention.
- a Terminal Symbol
- See: Lemmatisation Task, Word Type, Linguistic Agent, Orthographic Word Mention.
- (Mikheev, 2003) ⇒ Andrei Mikheev. (2003). “Text Segmentation.” In: (Mitkov, 2003).
- QUOTE: The first step in the majority of text processing applications is to segment text into words. The term 'word', however, is ambiguous: a word from a language's vocabulary can occur many times in the text but it is still a single individual word of the language. So there is a distinction between words of vocabulary or word types and multiple occurrences of these words in the text which are called word tokens. This is why the process of segmenting words tokens in text is called tokenization. Although the distinction between word types and word tokens is important it is usual to refer to the both as 'words' whenever the context unambiguously implies the interpretation.