- AKA: Terminal Word Mention, Surface Word.
- It can be within a linguistic expression instance.
- It can range from being an Atomic Word Instance (e.g. "[fire]!") to being an Embedded Word Instance (“The [fire] is out.”), within another linguistic utterance.
- It can have a Word Sense.
- It cannot be divided into smaller units without changing some of its Intension or Extension. E.g. “tea bag” <=> “bag of tea”.
- It can be associated with a Word Mention Boundary.
- It can be:
- It can range from:
- It can be identified by a Word Segmentation Task.
- It can be mapped to a Word Form (such as a Word Form Record) by a Word Mention Normalization Task.
- It can have a Word Mention Lemma (or really, be mapped to a Lexeme Lemma).
- See: Lemmatisation Task, Word Type, Linguistic Agent, Orthographic Word Mention.
- (Mikheev, 2003) ⇒ Andrei Mikheev. (2003). “Text Segmentation.” In: (Mitkov, 2003).
- QUOTE: The first step in the majority of text processing applications is to segment text into words. The term 'word', however, is ambiguous: a word from a language's vocabulary can occur many times in the text but it is still a single individual word of the language. So there is a distinction between words of vocabulary or word types and multiple occurrences of these words in the text which are called word tokens. This is why the process of segmenting words tokens in text is called tokenization. Although the distinction between word types and word tokens is important it is usual to refer to the both as 'words' whenever the context unambiguously implies the interpretation.