2007 FrameworkAndResourcesForNatLangEval

Jump to: navigation, search

Subject Headings: Natural Language Processing, Parsing, Algorithm Evaluation, Linguistic Resource


Cited By

~1 http://scholar.google.com/scholar?cites=16184430088452087458



Because of the wide variety of contemporary practices used in the automatic syntactic parsing of natural languages, it has become necessary to analyze and evaluate the strengths and weaknesses of different approaches. This research is all the more necessary because there are currently no genre- and domain-independent parsers that are able to analyze unrestricted text with 100% preciseness (I use this term to refer to the correctness of analyses assigned by a parser). All these factors create a need for methods and resources that can be used to evaluate and compare parsing systems. This research describes: (1) A theoretical analysis of current achievements in parsing and parser evaluation. (2) A framework (called FEPa) that can be used to carry out practical parser evaluations and comparisons. (3) A set of new evaluation resources: FiEval is a Finnish treebank under construction, and MGTS and RobSet are parser evaluation resources in English. (4) The results of experiments in which the developed evaluation framework and the two resources for English were used for evaluating a set of selected parsers.

1.2 Syntax and Parsing

Table 1-1. Types of knowledge of language. The levels most relevant to this work are highlighted (Allen 1995).
  • Type of knowledge Function
  • Phonetics and phonology How words are related to the sounds that realize them
  • Morphology How words are constructed from basic units
  • Syntax How words can be put together to form sentences
  • Semantics Meaning of words and sentences
  • Pragmatics How sentences are used in different situations
  • Discourse How preceding sentences affect the interpretation of a succeeding sentence
  • World knowledge General knowledge about the world that language users possess

Morphology is the study of word formation. The morphological processes of a natural language create completely new words or word forms from a root form. Syntax is the linguistic study that describes how a language user combines words to form phrases and sentences. Semantics is the study of the meanings created by words and phrases. It is the purpose of natural language parsers to describe the syntax of the input sentences, usually without any reference to semantics (Sikkel 1997). Some parsers can also perform a morphological analysis to capture the structure of individual words

2 Preprocessing

Before the syntactic analysis can be performed, input sentences must be preprocessed: Firstly, the units (sentences and words) need to be identified by segmentation. Segmentation methods are introduced in Section 2.1. In the second place, it is necessary to perform part-of-speech (POS) tagging and disambiguation (the process of selecting the correct tag from a set of possible tags) (Section 2.2) and a morphological analysis (Section 2.3). In syntactic analysis (Chapters 3 and 4), the syntactic structures of each input sentence are identified and marked. The labels assigned to words to denote their POS, morphological and syntactic roles are called tags. A tagset is the collection of tags used for a particular task.

3.2 Basic Concepts and Formal Properties of Grammars

Definition 3-1. Symbol, terminal and alphabet.
  • A symbol is a distinguishable character, such as “a”, “b” or “c”.
  • Any permissible sequence of symbols is called a terminal (also referred to as a word).
  • A finite, nonempty set [math]\Sigma[/math] of terminals is called an alphabet.
Definition 3-2. String and sets of strings.
Definition 3-3. Language and sentence.
  • Let [math]\Sigma[/math] be an alphabet.
  • Any subset [math]L[/math] of [math]\Sigma^*[/math] is called a language over alphabet [math]\Sigma[/math].
  • Sequence [math]δ = (α_1 α_2 … α_n)[/math], where [math]α_i \in L ∀ i, 1≤i≤n[/math], [math]n \in [/math] natural numbers, is called a sentence in language [math]L[/math].
Definition 3-4. Grammar, rules.
  • A grammar [math]G[/math] is a description of a language L.
  • A grammar [math]G[/math] consists of a lexicon and rules.
  • A lexicon is a structure defines the terminals in a language.
  • Rules describe how the terminals combine into larger entities.
Definition 3-5. Language generated by a grammar, derivation, grammatical and ungrammatical strings.
  • Let L(G) denote that grammar G generates language L.
  • The process of grammar rule applications is referred to as derivation.
  • L(G) is the set of sentences that can be derived by the grammar G.
  • The sentences that grammar G generates are referred to as grammatical.
  • The sentences that are not generated by G are referred to as ungrammatical.
Definition 3-6. Grammar formalism and grammatical theory.
  • A grammar formalism is a language used for expressing grammars.
  • A grammatical theory is the set of statements expressed in a grammar formalism.
  • The recognition problem is connected to the question if a given string is in a given language.
  • The parsing problem is concerned with the kinds of structures that are assigned to a given string.
Definition 3-7. Recognition problem and parsing problem (Ristad 2003, Nivre 2005).
  • The recognition problem (RP) is characterized by the question “Is a given string in a given language or not?”.
  • The parsing problem answers the question: “What structural descriptions are assigned by a given grammar to a given string?”
  • The parsing problem is tied to the corresponding recognition problem; only strings in L(G) are assigned an analysis in the parsing process.
  • Most parsing algorithms in fact solve the recognition and parsing problem simultaneously.
Definition 3-8. Fixed language and universal recognition problem (Ristad 1986).
  • A language may be characterized in terms of all the grammars that generate it.
  • This is referred to as the fixed language RP.
  • It can be stated as follows: “Is the string S in a language L?”
  • Another way to characterize a language is by a particular grammar that generates it.
  • This is referred to as the universal RP (URP), and can be stated as “Is the string S in a given grammar G?”
  • The URP is connected to a specified grammar, and thus, is more closely connected to the parsing problem
Definition 3-9. Context-free phrase structure grammar.

A context-free phrase structure grammar is a 4-tuple G = (N,∑,P,S), where

  • 1. [math]N[/math] is a finite set of nonterminal symbols.
  • 2. ∑ is a finite set of terminal symbols ; ∑ ∩ N = Ø.
  • 3. [math]P[/math] = {αβ|αN,β∈(Σ ∪N)*} is a finite set of production rules.
  • 4. S ∈ N is a distinguished symbol called the sentence symbol.

For example, the rule S → NP VP states that a sentence is a type of phrase that can be composed of a noun phrase (NP) and a verb phrase (VP). The rules of a CFPSG state how to replace a nonterminal without any reference to the context in which a nonterminal is located (Charniak & McDermott 1985).

A phrase structure rule (PS rule) specifies two relations: immediate dominance (ID) between a mother (α dominates β) and linear precedence (LP) relation among the daughters (the order of the symbols on β) (Gazdar et al. 1985)

  • Context-sensitive grammars (CSGs) and context-free grammars (CFGs) are the two most relevant classes of grammars for parsing (Geman & Johnson 2002, Winograd 1983).

3.3.1 Probabilistic, lexicalized and transition network models

In probabilistic grammar formalisms a probability is associated with each rule. For example, probabilistic context-free grammars (PCFGs) can be characterized as CFPSGs that assign to each production the probability of its use. A PCFG (Booth & Thompson 1973) is a 4-tuple G = (N, Σ, P, S) like a CFPSG (Definition 3-9), except that each rule in P is associated with a probability in which [math]α[/math] will be expanded to β.

Definition 3-10. Productions in probabilistic context-free phrase structure grammar.
  • P = {a ®b a Î N,b Î(ΣÈN)*}.

2. There is a probability function p: P→[0,1] such that for each a ÎN,Σ p(a ®b a )= 1.

Definition 3-11. Production rules in a lexicalized context-free phrase structure grammar.
  • P = {a ®b a Î N,b Î(ΣÈN)* a (ΣÈN)*}, where a is a distinguished word in the that is referred to as the anchor.

A straightforward way of constructing a lexicalized grammar is to take a CFPSG and make many copies of each rule – one for every possible anchor word in each phrase (Schabes et al. 1988, Nederhof & Satta 1994).

3.3.2 Unification grammars

Unification grammars (UGs), which are also known as constraint-based grammars, were specifically developed to overcome the problems that CFPSGs encounter when representing fine-grained grammatical information such as agreement and subcategorization (Carpenter 1989). This section introduces three of the best-known UGs, namely, Generalized Phrase Structure Grammar (GPSG) (Gazdar et al. 1985). “Head-driven Phrase Structure Grammar (HPSG) (Pollard & Sag 1987, 1994) and Lexical Functional Grammar (LFG) (Kaplan & Bresnan 1982). UGs are able to model more complex linguistic phenomena than CFGs. These formalisms describe linguistic objects by using feature structures consisting of features and associated values that encode several levels of linguistic information (e.g. morphological, syntactic, semantic) in a uniform way.

Definition 3-12. Head-driven Phrase Structure Grammar (Pollard & Sag 1987).

A Head-driven Phrase Structure Grammar is defined as P1 Ù ...Ù Pn+m Ù (L1 Ú ...Ú Lp Ú R1 Ú ...Ú Rq ) where P1…Pn are universal principles common to all languages, Pn+1…Pn+m are language-specific principles, L1…Lp are the lexical signs of a language, and R1…Rq are its grammar rules.

3.3.3 Tree-adjoining grammars

Tree-adjoining Grammar (TAG), which was first introduced by Joshi et al. (1975), manipulates trees as elementary objects. Different versions of TAGs have been implemented in English (XTAG Research Group 1998), Chinese (Bikel & Chiang 2000), French (Abeillé 1988), German (Neumann 2003), Arabic (Habash & Rambow 2004), Korean (Han et al. 2000) and Hindi (CDAG 2006).

Definition 3-13. Tree-adjoining Grammar.

A Tree-adjoining Grammar is a 5-tuple TAG = (N, Σ, I, A, S) where

  1. . N is a finite set of nonterminal symbols; N ÇS = Æ;
  2. . Σ is a finite set of terminal symbols;
  3. . I is a finite set of finite initial trees;
  4. . A is a finite set of finite auxiliary trees, and
  5. . S ÎN is a distinguished symbol called the sentence symbol.

3.3.4 Dependency grammars

Dependency grammars (DGs) identify the relations that connect words to each other. A fundamental notion in DGs is the relation between a head and a dependent.

Definition 3-14. Dependency relation (adapted from Robison 1970).

Let E be a finite set of sentence elements.

  • DR is a binary dependency relation that ranges over the elements E; D R Í E ´ E . Let e1, e2, xÎE.
  • Let <e1, e2> in DR denote that e1 is dependent on e2. e2 is referred to as the head.
  • If DR holds for <e1, x>…<ek, x>, all eiÎ E, i Î{1...k} are dependent on x.

A head may have any number of dependents, which may be either modifiers or complements. The grammar rules specify the dependents that each head can take (e.g. adjectives depend on nouns and not on verbs). I refer to the DG formalisms, as defined by Hays (1964) and Gaifman (1965), as classical DGs (CDGs). The properties of CDGs can be formally defined as follows:

Definition 3-15. Classical dependency grammar (adapted from Hayes (1964) and Gaifman (1965)).

A dependency grammar is a 4-tuple DG = (C, Σ, R, F) where

  1. . C is a finite set of lexical categories.
  2. . Σ is a finite set of terminal symbols.
  3. . F is an assignment function F : Σ ®C .
  4. . R is a finite set of rules over the lexical categories C that define dependency relations DR. The rules are of the form:
    1. . () x c1,...,*,...,ck denotes that c1…ckÎC are dependent on xÎC . I.e. * indicates the position of head xÎC in the sequence c1…ckÎC,
    2. . x(*) denotes that xÎC is a leaf node, and
    3. . *(x) denotes that xÎC is the sentence root node.

3.3.6 Combinatory Categorial Grammar

In Categorial Grammars (CatG), which originate from the work of Bar-Hillel (1953), the lexicon carries the main responsibility for defining syntactic knowledge (Steedman 2000). CatGs define a finite set of primitive categories (such as N, NP, VP, S) that combine by means of function application rules to create more complex structures. Unlike other grammar formalisms, CatGs do not define a set of rules for combining words. It is rather the definitions in the lexicon that determine how words can combine with each other.

3.4.4 Long-distance dependencies

Long-distance dependencies (LDDs) are problematic for any theory of grammar because one needs to use non-local information when one analyzes the structures that contain them – linguistic structures cannot always be interpreted locally in the places where they are encountered. Table 3-4 illustrates how LDDs can arise out of phenomena such as extraction and coordination.

Table 3-4. Examples of long-distance dependencies.

  • Cause Type Example
  • Extraction Wh-relative clause "This is the player [who1 the coach despises […]1]"
  • Extraction Tough-movement "The coach1 is hard to please […]1"
  • Coordination Sentential gapping "Harry played1 football and Robbie […]1 tennis."
  • Coordination Right-node raising "Harry passed […]1 and Robbie headed the ball1."

4. Syntactic Analysis – Parsing Algorithms

The parsing algorithm is the component of a parser that applies the grammar to the input sentences to construct parse trees. Parsing algorithms are usually not designed for individual grammars, but for classes of grammars. This chapter provides an overview on fundamental search strategies (Section 4.1), parsing algorithms (Section 4.2) and approaches to parsing (viz. probabilistic, lexicalized and finite-state approaches) (Section 4.3). It also analyzes how these are applied in parsing specific grammar formalisms (Section 4.4). Section 4.5 considers the computational properties of the formalisms introduced in Chapter 3. Section 4.6 concludes the discussion on grammar formalisms (Chapter 3) and their computational properties (Chapter 4).

4.1 Introduction

One may consider the problem of how to define search algorithms from several different points of view (Hellwig 2002, Pulman 1993). One may therefore describe parsing algorithms in terms of the direction in which the structure is built (whether top-down (Yngve 1959) or bottom-up (Glennie 1960)), or the way in which the search is executed (either breadth-first or depth-first), or in terms of the direction in which the input words are processed (from left-to-right or right-to-left).

4.2 Parsing Algorithms

This chapter introduces the most fundamental parsing algorithms that were originally developed for parsing CFPSGs (Section 4.2.1) and two commonly used techniques for increasing speed, namely supertagging and CF filtering (Section 4.2.2).

4.2.1 Fundamental algorithms

A common strategy for developing a parsing algorithm for classes of grammars that are more powerful than CFGs is to generalize a CFG algorithm to the more powerful class (Van Noord 1994).

4.3 Probabilistic, Lexicalized and Finite-state Parsing

While non-probabilistic algorithms (such as those described in Section 4.2.1) regard parsing as the recursive application of predetermined rewrite rules, probabilistic parsing (Section 4.3.1) approaches the problem by means of automatically discovering the disambiguation criteria for the decisions made during the parsing process. Because many of the grammar formalisms applied in parsing are lexicalized, lexicalized parsing (Section 4.3.2) has become increasingly important. Probabilistic and lexicalized parsing are often combined. Finite-state machines have been used for many NLP tasks such as segmentation and morphological analysis. Section 4.3.2 introduces a discussion on how FSMs can be applied in parsing.

4.3.1 Probabilistic parsing

In addition to defining the probabilities for parses and thus finding out the most probable analyses, probabilistic information can be used for speeding up the parsing process by ordering the search space (Magerman 1994, Collins 2003). The goal here is to identify the best parse more quickly while simultaneously not undermining the quality of the produced results. Figure 4-2 shows the idea of probabilistic parsing.

Figure 4-2. The components of a probabilistic parser. The sample consists of sentences in language L. It usually consists of annotated sentences obtained from a treebank. The probabilistic model defines possible analyses for sentences in language L. The parsing algorithm defines the analyses for the sentences of a given text in L, relative to the parsing model.

There are three phases in the development of a probabilistic parsing model. In the parametrization phase, the method for defining the probabilities of analyses has to be decided (Hockenmaier 2003). In the training phase, the probability distributions have to be instantiated with a sample. Finally, a method for measuring the quality of a particular model has to be chosen. The model evaluation is typically carried out in the following way. The sample is split into three sets: a training set, a development set, and final test set. The parameters of the model are estimated on the training set, tuned on the development set and tested on the final test set which contains unseen data.

A probabilistic parsing algorithm assigns probabilities p(π|s) to each parse π of the sentence s and

4.4 Examples of Grammar Formalism-specific Algorithms

This section introduces techniques that are applied in parsing specific grammar formalisms, and it offers examples of parsers for these formalisms.

4.4.1 Parsing PCFGs

Head-driven statistical parsing (Collins 1996) extends the basic PCFG model by lexicalization. Collins has extended this original framework by introducing what are generally known as the Collins Models 1, 2 and 3 (Collins (1997, 2003)). Model 1 is the baseline generative model based on (Collins 1996). Model 2 makes a distinction between complements and adjuncts and has parameters that correspond directly to the probability distributions over subcategorization frames for head-words. Model 3 adds a probabilistic treatment of a type of LDD, whmovement.

  • Another well-known example of the use of a lexicalized PCFG is the parser by Charniak (1996, 1997, 2000). Like Collins, Charniak uses the Markov process for rule generation. This approach is based on an ME model. Charniak’s model uses a richer set of features than Collins’s models.

5.1 Problems in Parsing and Their Solutions

At the time of writing (2007), no domain- and genre-independent natural language parser capable of producing error-free parses for running text has ever been devised. The following factors make it difficult to parse unrestricted text (Leech et al. 1996, Briscoe 1998):

  1. . Ambiguity is an inherent property of all natural languages, and it affects parsers on the level of lexicon and grammar. Linguistic expressions taken out of context are ambiguous and incomplete. Parsing a sentence often thus results in more than one analysis for the input sentence.
  2. . Sentences in free text are often long and contain several clauses and phrases. This causes the number of possible analyses to grow exponentially.
  3. . Because parsers have no world knowledge, they have to rely solely on whatever information they can derive from linguistic rules.
  4. . It is necessary for a grammar with a broad coverage50 to be
Example 5-1.

Globally ambiguous sentence A can be interpreted semantically in at least four different ways. The most probable reading is that a female avoids balls flying overhead by ducking. The other interpretations involve odd scenarios featuring a bird being constructed and a woman flying. Although the locally ambiguous sentence B is not ambiguous as a whole, if the three last words are examined in isolation, one can come up with the interpretation that Liverpool sold Everton – even though the sentence as a whole does not make such proposition. Sentence C is a garden path sentence.

  • A) Flying balls made her duck.
  • B) The company that bought Liverpool sold Everton.
  • C) The referee who whistles tunes the whistle.

Structural ambiguity has multiple causes (Jurafsky & Martin 2000). Coordination ambiguity is caused by a situation in which different sets of phrases can be conjoined. The phrase “accurate shots and crosses”, for example, can be interpreted so that “accurate” modifies either “shots” or “shots and crosses”. In attachment ambiguity, a constituent can be attached to more than one place in a sentence. Resolving attachment ambiguities correctly requires the use of several sources of information. In the sentence “The coach saw the player with the telescope”, we have an example of a common type of attachment ambiguity, namely prepositional phrase (PP) attachment ambiguity (Collins 1999).51 This sentence may be analyzed in at least two possible ways – the PP “with the telescope” modifies either “coach” or “saw”, and this leads to the analyses shown in Figure 5-1.


6. Analysis of Existing Resources and Schemes

Evaluation of the correctness of a parser’s output is generally done by comparing the system output to correct human-constructed structures. These gold standard parses are obtained from a linguistic resource. Section 6.1 analyzes existing linguistic resources and their suitability for parser evaluation. Linguistic annotation (hereafter referred to as annotation) refers to the notations applied to language data that describes its information content. The annotation in a treebank, for example, includes at least POS tags and syntactic tags. An annotation scheme refers to the specification of a set of practices used for annotation in a particular linguistic resource. An encoding scheme defines the way in which the annotated data is represented. I will both introduce the annotation and encoding schemes used in existing linguistic resources and analyze their suitability for parser

6.1 Evaluation Resources

The most commonly used linguistic resources for parser evaluation are treebanks, which are collections of syntactically annotated sentences. These syntactically annotated corpora consist of sentences which have been assigned parse trees with at least syntactic and morphosyntactic annotation. Treebanks are described in Section 6.1.1. Test suites are collections of annotated test items that are organized in terms of specific linguistic phenomena (Section 6.1.2). One may describe them as treebanks that have been tailored for evaluation purposes because all the sentences in them have been annotated with syntactic information that elucidates the syntactic phenomena they contain. Section 6.1.3 describes a corpus of ungrammatical sentences. I conclude the findings in Section 6.1.4.

10.3.4 Conclusion

The six parsers were able to cover, on average, 82.4% of the sentences. The coverage was, unsurprisingly, highest on the newspaper genre. The lowest average coverage was achieved on the religion and legislation genres. The difficulties in parsing the religious texts are attributable at least in part to the length of the sentences in the sub-corpus (on average 27.1 words per sentence), which was the highest over all the genres. The legislation genre consists of transcribed speech, which may be the main reason for the lower-than-average performance on that data. Contrary to my expectation, the biomedical genre, with its specialist terminology, was not the most difficult genre for the parsers.



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 FrameworkAndResourcesForNatLangEvalTuomo KakkonenFramework and Resources for Natural Language EvaluationAcademic Dissertationhttp://arxiv.org/ftp/arxiv/papers/0712/0712.3705.pdf2007