1999 FoundationsOfStatisticalNLP

Jump to navigation Jump to search

Subject Headings: Statistical NLP, Word Sense Disambiguation, Collocation, Lexical Acquisition, Markov Model, Part-of-Speech Tagging, Probabilistic Context Free Grammar, Probabilistic Parsing, Machine Translation, Information Retrieval, Text Categorization.


Cited By


Statistical approaches to processing natural language text have become dominant in recent years. This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications.

Table of Contents

I Preliminaries

1. Introduction
1.1. Rationalist and Empiricist Approaches to Language
1.2 Scientific Content
1.3 The Ambiguity of Language: Why NLP Is Difficult
1.4 Dirty Hands
1.5 Further Reading
1.6 Exercises
2 Mathematical Foundations
2.1 Elementary Probability Theory
2.2 Essential Information Theory
2.3 Further Reading
3 Linguistics Essentials
3.1 Parts of Speech and Morphology
3.2 Phrase Structure
3.3 Semantics and Pragmatics
3.4 Other Areas
3.5 Further Reading
3.6 Exercises
4 Corpus-Based Work
4.1 Getting Set Up
4.2 Looking at Text
4.3 Marked-Up Data
4.4 Further Reading
4.5 Exercises
II Words
5 Collocations
5.1 Frequency
5.2 Mean and Variance
5.3 Hypothesis Testing
5.4 Mutual Information
5.5 The Notion of Collocation
5.6 Further Reading
6 Statistical Inference: n-gram Models over Sparse Data
6.1 Bins: Forming Equivalence Classes
6.2 Statistical Estimators
6.3 Combining Estimators
6.4 Conclusions
6.5 Further Reading
6.6 Exercises
7 Word Sense Disambiguation
7.1 Methodological Preliminaries
7.2 Supervised Disambiguation
7.3 Dictionary-Based Disambiguation
7.4 Unsupervised Disambiguation
7.5 What Is a Word Sense?
7.6 Further Reading
7.7 Exercises
8 Lexical Acquisition
8.1 Evaluation Measures
8.2 Verb Subcategorization
8.3 Attachment Ambiguity
8.4 Selectional Preferences
8.5 Semantic Similarity
8.6 The Role of Lexical Acquisition in Statistical NLP
8.7 Further Reading
III Grammar
9 Markov Models
9.1 Markov Models
9.2 Hidden Markov Models
9.3 The Three Fundamental Questions for HMMs
9.4 HMMs: Implementation, Properties, and Variants
9.5 Further Reading
10 Part-of-Speech Tagging
10.1 The Information Sources in Tagging
10.2 Markov Model Taggers
10.3 Hidden Markov Model Taggers
10.4 Transformation-Based Learning of Tags
10.5 Other Methods, Other Languages
10.6 Tagging Accuracy and Uses of Taggers
10.7 Further Reading
10.8 Exercises
11 Probabilistic Context Free Grammars
11.1 Some Features of PCFGs
11.2 Questions for PCFGs
11.3 The Probability of a String
11.4 Problems with the Inside-Outside Algorithm
11.5 Further Reading
11.6 Exercises
12 Probabilistic Parsing
12.1 Some Concepts
12.2 Some Approaches
12.3 Further Reading
12.4 Exercises
IV Applications and Techniques
13 Statistical Alignment and Machine Translation
13.1 Text Alignment
13.2 Word Alignment
13.3 Statistical Machine Translation
13.4 Further Reading
14 Clustering
14.1 Hierarchical Clustering
14.2 Non-Hierarchical Clustering
14.3 Further Reading
14.4 Exercises
15 Topics in Information Retrieval
15.1 Some Background on Information Retrieval
15.2 The Vector Space Models
15.3 Term Distribution Models
15.4 Latent Semantic Indexing
15.5 Discourse Segmentation
15.6 Further Reading
15.7 Exercises
16 Text Categorization
16.1 Decision Trees
16.2 Maximum Entropy Modeling
16.3 Perceptrons
16.4 k Nearest Neighbor Classification
16.5 Further Reading

3. Linguistic Essentials

3.1 Parts of Speech and Morphology

Linguists group the words of a language into classes (sets) which show similar syntactic behavior, and often a typical semantic type. These words classes are otherwise called syntactic or grammatical categories, but more commonly still by the traditional name parts of speech (POS).

Normally the various parts of speech for a words are listed in an online dictionary, otherwise known as a lexicon.

Word categories are systematically related by morphological processes such as the formation of the plural form (dog-s). from the singular form of the noun (dog). Morphology is important in NLP because language is productive: in any given text we will encounter words and word forms that we haven't seen before and that are not in our precompiled dictionary. Many of these new words are morphologically related to known words. So if we understand morphological processes, we can infer a log about the syntactic and semantic properties of new words.

The major types of morphological processes are inflection, derivation, and compounding. Inflections are the systematic modifications of a root form by means of prefixes and suffixes to indicate grammatical distinctions like singular and plural. Inflection does not change words class of meaning significantly, but varies features such as tense, number, and plurality. All the inflectional forms of a word are often grouped as manifestations of a single lexeme.

Derivation is less systematic. It usually results in a more radical change of syntactic category, and it often involves a change in meaning. An example is the derivation of the adverb widely from the adjective wide (by appending the suffix -ly). “Widely in a phrase like it is widely believed means among a large well-dispersed group of people, a shift from the core meaning of wide (extending over a vast area). Adverb formation is also less systematic than plural inflection. Some adjectives like old or difficult don't have adverbs: *oldly and *difficultly are not words of English. Here are some other examples of derivations: the suffix -en transforms adjectives into verbs (weak-en, soft-en), the suffix -able transforms verbs into adjectives (understand-able, accept-able), and the suffix -er transforms verbs into nouns (teach-er, lead-er).

Compounding refers to the merging of two or more words into a new word. English has many noun-noun compounds, nouns that are combinations of two other nouns. Examples are tea kettle, disk drive, or college degree. While these are (usually) written as separate words, they are pronounced as a single word, and denote a single semantic concept, which one would normally wish to list in the lexicon. There are also other compounds that involve parts of speech such as adjectives, verbs, and prepositions, such as downmarket, (to) overtake, and mad cow disease.


Hyphenation: Different forms representing the same word.

Do sequences of letters with a hyphen in between count as one word or two? Again, the intuitive answer seems to be sometimes one, sometimes two. This reflects the many sources of hyphen in texts.

One source is typographical. Words have traditionally been broken and hyphens inserted to improve justification of text.

Some things with hyphens are clearly best treated as single words, such as e-mail or co-operate, or A1-plus (as in A-1-plus commercial paper, a financial rating). Other cases are more arguable, although we usually want to regard them as a single words, for example, non-lawyer, pro-Arb, and so-called. The hyphens here might be termed lexical hyphens. They are commonly inserted before or after small words formatives, sometimes for the purpose of splitting up vowel sequences.

The third class of hyphens is ones inserted to help indicate the correct grouping of words. A common copy-editing practice is to hyphenate compound pre-modifiers, as in the example earlier in the sense or in examples like these:

  • 4.1a the once-quiet study of superconductivity.
  • 4.1b a though regime of business-conduct rules
  • 4.1c the aluminum-export ban.
  • 4.1d a text-based medium.

And hyphens occur in other places, where a phrase is seen as in some sense quotative or as expressing a quantity or rate:

  • 4.2a the ideas of a child-as-required-yuppie-possession must be motivating them.
  • 4.2b a final "take-it-or-leave-it" offer.
  • 4.3c the 90-cent-an-hour raise.
  • 4.3d the 26-year-old

In these cases, we would probably want to treat the things joined by hyphens as separate words.

Note that this means that we will often have multiple forms, perhaps some treated as one words and others as two, for what is best though of as a single lexeme (a single dictionary entry with a single meaning).

Word segmentation in other languages

Many languages do not put spaces in between words at all, and so the basic word division algorithm of breaking on whitespace is of no use at all. Such languages include major East-Asian languages/scripts such as Chinese, Japanese, and Thai. Ancient Greek was also written by Ancient Greeks without words spaces. Spaces were introduced (together with accents marks, etc.) by those who came afterwards. In such languages, word segmentation is a much more major and challenging task.

While maintaining most words spaces, in German compound nouns are written as single words, for example Lebensversicherungsgesellschaftsangestellter 'life insurance company employee.' In many ways this makes linguistic sense, as compounds are a single words, at least phonologically. But for process purposes one may wish to divide such a compound, or at least to be aware of the internal structure of the words, and this becomes a limited words segmentation task. While not the rule, joining of compounds sometimes also happens in English, especially when they are common and have a specialized meaning. We noted above that one finds both data base and database. As another examples, while hard disk is more common, one sometimes finds harddisk in the computer press.

Whitespace not indicating a word break

Until now, the problems we have dealt with have mainly involved splitting apart sequence of characters where the word division are not shown by whitespace. But the opposite problem of wanting to lump things together also occurs. Where, things are separate by whitespace but we may with to regard them as a single word. One possible case is the reverse of the German compound problem. If one decides to treat database as one word, one may wish to treat it as one word even when it is written as database. More common cases are things such as phone numbers, where we may with to recard 9465 1873 as a single 'word,' or in the cases of multi-part names such as New York or San Francisco. An especially difficult case is when this problem interacts with hyphenation as in a phrase like this one:

  • (4.3) the New York-New Haven railroad.

Here the hyphen does not express grouping of just the immediate adjacent graphic words - treating York-New as a semantic unit would be a big mistake.

Other cases are of more linguistic interest. For many purposes, one would want to regard phrasal verbs (make up, work out) as single lexemes (section 3.1.4), but this case is especially tricky since in many cases the particle is separable from the verb (I couldn't work' the answer out), and so in general identification of possible phrasal verbs will have to be left to subsequent processing. One might also want to treat as a single lexeme creating other fixed phrases, such as in spite of/, in order to, and because of, but typically a tokenizer will regard them as separate words. A partial implementation of this approach occurs in the LOB corpus where certain pairs of words such as because of are tagged with a single part of speech, here preposition, by means of using so-called ditto tags.


Another question is whether on wants to keep word forms like sit, sits and sat separate or to collapse them. The issues here are similar to those in the discussion of capitalization, but have traditionally been regarded as more linguistically interesting. At first, grouping such forms together and working in terms of lexemes feels as if it is the right thing to do. Doing this is usually referred to in the literature as stemming in reference to a process that strips off affixes and leaves you with a stem. Alternatively, to find the lemma or lexeme of which one is looking at an inflected form. These latter terms imply disambiguation at the level of lexemes, such as whether a use of lying represents the verb lie-lay 'to prostrate oneself' or lie-lied/ 'to fib.'

Extensive empirical research within the Information Retrieval (IR) community has shown that doing stemming does not help the performance of classic IR system when performance is measure as an average over queries (Salton 1989; Hull 1996). There are always some queries for which stemming helps a lot. But there are other where performance goes down. This is a somewhat surprising result, especially from the viewpoint of linguist intuition, and so it is important to understand why that is. There are three main reasons for this.


A COLLOCATION is an expression consisting of two or more words that correspond to some conventional way of saying things. Or in the words of Firth (1957:181): "Collocations of a given word are statements of the habitual or customary places of that word.” Collocations include noun phrass like strong tea and weapons of mass destruction, phrasal verbs like to make up, and other stock phrases like the rich and powerful. Particularly interesting are the subtle and not-easily-explainable patterns of word usage that native speakers all know: why we say a stiff breeze but no ??a stiff wind (while either a a strong breeze or a a strong wind is okay), or why we speak of broad daylight (but not ?bright daylight or ??narrow darkness)).

Collocations are characterized by limited compositionality. We call a natural language expression compositions if the meaning of the expression can be predicted from the meaning of the parts. Collocations are not fully compositional in that there is usually an element of meaning added to the combination. In the case of strong tea, strong has acquired the meaning rich in some active agent which is closely related, but slightly different from the basic sense having great physical strength. Idioms are the most extreme examples of non-compositionality. Idioms like to kick the bucket or to hear it through the grapevine only have an indirect historical relationship to the meaning of the expression. We are not talking about buckets or grapevines literally when we use these idioms.

There is considerable overlap between the concept of collocation and notions like term, technical term and terminological phrase. As these names suggest, the latter three are commonly used when collocations are extracted from technical domains (in a process called terminology extraction). The reader shoull be warned, though, that the word term has a different meaning in information retrieval. There, it refers to both words and phrases. So it subsumes the more narrow meaning that we will use in this chapter.

8 Lexical Acquisition

8.1 Evaluation Measures

8.2 Verb Subcategorization

8.3 Attachment Ambiguity

8.4 Selectional Preferences

8.5 Semantic Similarity

= 8.5.2 Probabilistic Measures

... (Dis-)similarity measureinformation radius (IRad) … [math]\displaystyle{ D(p \vert\vert \frac{p+q}{2}) + D(q \vert\vert \frac{p+q}{2}) }[/math]

KL Divergence

We are already familiar with …

Information Radius

The second measure in table 8.9, information radius (or total divergence to the average as Dagan et al. (1997b) call it), overcomes both these problems. It is symmetric ([math]\displaystyle{ \operatorname{IRad}(p,q) = \operatorname{IRad}(q,p) }[/math]) and there is no problem with infinite values since [math]\displaystyle{ \frac{p_i+q_i}{2} \ne 0 }[/math] if either [math]\displaystyle{ p_i \ne 0 }[/math] or [math]\displaystyle{ q_i \ne 0 }[/math]. The intuitive interpretation of IRad is that it answer the question: How much information is lost if we describe the two words (or random variables in the general case) that correspond to [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] with this average distribution? IRad ranges from 0 for identical distributions to [math]\displaystyle{ 2\log 2 }[/math] for maximally different distributions (see exercise 8.25). As usual we assume [math]\displaystyle{ 0\log 0 = 0 }[/math].

[math]\displaystyle{ L_1 }[/math] Norm ; Manhattan Norm

A third …

8.6 The Role of Lexical Acquisition in Statistical NLP




 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1999 FoundationsOfStatisticalNLPHinrich Schütze
Christopher D. Manning
Foundations of Statistical Natural Language ProcessingThe MIT Presshttp://books.google.com/books?id=YiFDxbEX3SUC1999