k-Skip n-Gram

From GM-RKB
Jump to navigation Jump to search

A k-Skip n-Gram is a subset of an unordered n-gram that can involve a skip operation of size [math]\displaystyle{ k }[/math].



References

2017

2015

  • (Wikipedia, 2015) ⇒ http://en.wikipedia.org/wiki/N-gram#Skip-Gram Retrieved:2015-2-6.
    • In the field of computational linguistics, in particular language modeling, skip-grams[1] are a generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over.[2] They provide one way of overcoming the data sparsity problem found with conventional n-gram analysis.

      Formally, an n-gram is a consecutive subsequence of length n of some sequence of tokens w₁ … wₙ. A k-skip-n-gram is a length-n subsequence where the components occur at distance at most k from each other.

      For example, in the input text:

            the rain in Spain falls mainly on the plain

      the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences

            the in, rain Spain, in falls, Spain mainly, mainly the and on plain.

      Recently, Mikolov et al. (2013) have demonstrated that skip-gram language models can be trained so that it is possible to do ″word arithmetic". In their model, for example the expression

            king - man + woman

      evaluates very close to queen.[3]

2013

2006

  • (Guthrie et al., 2006) ⇒ David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. (2006). “A Closer Look at Skip-gram Modelling.” In: Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC-2006).
    • QUOTE: We define k-skip-n-grams for a sentence [math]\displaystyle{ w_1 ... w_m }[/math] to be the set : [math]\displaystyle{ \{ w_{i_1}, w_{i_2}, ... w_{i_n} \mid \Sigma_{j=1}^n i_j - i_{j-1} \lt k \} }[/math] Skip-grams reported for a certain skip distance [math]\displaystyle{ k }[/math] allow a total of [math]\displaystyle{ k }[/math] or less skips to construct the n-gram. As such, “4-skip-n-gram” results include 4 skips, 3 skips, 2 skips, 1 skip, and 0 skips (typical n-grams formed from adjacent words).

      Here is an actual sentence example showing 2-skip-bigrams and tri-grams compared to standard bi-grams and trigrams consisting of adjacent words for the sentence:

            “Insurgents killed in ongoing fighting.”

      Bi-grams = {insurgents killed, killed in, in ongoing, ongoing fighting}.

      2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}

      Tri-grams = {insurgents killed in, killed in ongoing, in ongoing fighting}.

      2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.

      In this example, over three times as many 2-skip-tri-grams were produced than adjacent tri-grams and this trend continues the more skips that are allowed. A typical sentence of ten words, for example, will produce 8 trigrams, but 80 4-skip-tri-grams. Sentences that are 20 words long have 18 tri-grams and 230 4-skip-tri-grams (see Table 1).