2006 ACloserLookatSkipGramModelling

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Skip-Gram Model.

Notes

Cited By

Quotes

Abstract

Data sparsity is a large problem in natural language processing that refers to the fact that language is a system of rare events, so varied and complex, that even using an extremely large corpus, we can never accurately model all possible strings of words. This paper examines the use of skip-grams (a technique where by n-grams are still stored to model language, but they allow for tokens to be skipped) to overcome the data sparsity problem. We analyze this by computing all possible skip-grams in a training corpus and measure how many adjacent (standard) n-grams these cover in test documents. We examine skip-gram modelling using one to four skips with various amount of training data and test against similar documents as well as documents generated from a machine translation system. In this paper we also determine the amount of extra training data required to achieve skip-gram coverage using standard adjacent tri-grams.

1. Introduction

Recent corpus based trends in language processing rely on a single premise: that language is its own best model and that sufficient data can be gathered to depict typical (or atypical) language use accurately (Young and Chase, 1998; Church, 1998; Brown, 1990). The chief problem for this central tenet of modern language processing is the data sparsity problem: that language is a system of rare events, so varied and complex, that we can never model all possibilities. Language modelling research uses smoothing techniques to model these unseen sequences of words, yet even with 30 years worth of newswire text, more than one third of all trigrams have not been seen (Allison et al., 2006).

It therefore falls to the linguist to exploit the available data to the maximum extent possible. Various attempts have been made to do this, but they largely consist of defining and manipulating data beyond the words in the text (partof- speech tags, syntactic categories, etc.) or using some form of smoothing to estimate the probability of unseen text. However, this paper posits another approach to obtaining better model of training data relying only on the words used: the idea of skip-grams.

Skip-grams are a technique largely used in the field of speech processing, whereby n-grams are formed (bigrams, tri-grams, etc.) but in addition to allowing adjacent sequences of words, we allow tokens to be “skipped”. While initially applied to phonemes in human speech, the same technique can be applied to words. For example, the sentence “I hit the tennis ball” has three word level trigrams: “I hit the”, “hit the tennis” and “the tennis ball”. However, one might argue that an equally important trigram implied by the sentence but not normally captured in that way is “hit the ball”. Using skip-grams allows the word “tennis” be skipped, enabling this trigram to be formed. Skip-grams have been used many different ways in language modelling but often in conjunction with other modelling techniques or for the goal of decreasing perplexity (Goodman, 2001; Rosenfeld, 1994; Ney et al., 1994; Siu and Ostendorf, 2000).

The focus of this paper is to quantify the impact skip-gram modelling has on the coverage of trigrams in real text and compare this to coverage obtained by increasing the size of the corpus used to build a traditional language model.

2. Defining skip-grams

We define k-skip-n-grams for a sentence [math]\displaystyle{ w_1 … w_m }[/math] to be the set

[math]\displaystyle{ \{ w_{i_1}, w_{i_2}, … w_{i_n} \mid \Sigma_{j=1}^n i_j - i_{j-1} \lt k \} }[/math]

Skip-grams reported for a certain skip distance [math]\displaystyle{ k }[/math] allow a total of [math]\displaystyle{ k }[/math] or less skips to construct the n-gram. As such, “4-skip-n-gram” results include 4 skips, 3 skips, 2 skips, 1 skip, and 0 skips (typical n-grams formed from adjacent words).

Here is an actual sentence example showing 2-skip-bigrams and tri-grams compared to standard bi-grams and trigrams consisting of adjacent words for the sentence:

Insurgents killed in ongoing fighting.”

Bi-grams = {insurgents killed, killed in, in ongoing, ongoing fighting}.

2-skip-bi-grams = {insurgents killed, insurgents in, insurgents ongoing, killed in, killed ongoing, killed fighting, in ongoing, in fighting, ongoing fighting}

Tri-grams = {insurgents killed in, killed in ongoing, in ongoing fighting}.

2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing, insurgents killed fighting, insurgents in ongoing, insurgents in fighting, insurgents ongoing fighting, killed in ongoing, killed in fighting, killed ongoing fighting, in ongoing fighting}.

In this example, over three times as many 2-skip-tri-grams were produced than adjacent tri-grams and this trend continues the more skips that are allowed. A typical sentence of ten words, for example, will produce 8 trigrams, but 80 4-skip-tri-grams. Sentences that are 20 words long have 18 tri-grams and 230 4-skip-tri-grams (see Table 1).

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 ACloserLookatSkipGramModellingYorick Wilks
Wei Liu
David Guthrie
Ben Allison
Louise Guthrie
A Closer Look at Skip-gram Modelling2006