n-Gram Tuple

From GM-RKB
(Redirected from n-gram)
Jump to navigation Jump to search

An n-Gram Tuple is a tuple that represents a string subsequence.



References

2015

2011

1994

  • (Cavnar & Trenkle, 1994) ⇒ William B. Cavnar, and John M. Trenkle. (1994). “N-gram-based Text Categorization.” In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.
    • QUOTE: An N-gram is an N-character slice of a longer string. Although in the literature the term can include the notion of any co-occurring set of characters in a string (e.g., an N-gram made up of the first and third character of a word), in this paper we use the term for contiguous slices only. Typically, one slices the string into a set of overlapping N-grams. In our system, we use N-grams of several different lengths simultaneously. We also append blanks to the beginning and ending of the string in order to help with matching beginning-of-word and ending-of-word situations. (We will use the underscore character (“_”) to represent blanks.) Thus, the word “TEXT” would be composed of the following N-grams:
      • bi-grams: _T, TE, EX, XT, T_
      • tri-grams: _TE, TEX, EXT, XT_, T_ _
      • quad-grams: _TEX, TEXT, EXT_, XT_ _, T_ _ _
    • In general, a string of length [math]\displaystyle{ k }[/math], padded with blanks, will have k+1 bi-grams, k+1tri-grams, k+1 quad-grams, and so on.