- (Marcu, 1997) ⇒ Daniel Marcu. (1997). “The Rhetorical Parsing, Summarization, and Generation of Natural Language Texts." PhD Thesis. University of Toronto.
- (Marcu, 2000) ⇒ Daniel Marcu. (2000). “The Theory and Practice of Discourse Parsing and Summarization." MIT Press. ISBN:0262133725
This thesis is an inquiry into the nature of the high-level, rhetorical structure of unrestricted natural language texts, computational means to enable its derivation, and two applications (in automatic summarization and natural language generation) that follow from the ability to build such structures automatically.
The thesis proposes a first-order formalization of the high-level, rhetorical structure of text. The formalization assumes that text can be sequenced into elementary units; that discourse relations hold between textual units of various sizes; that some textual units are more important to the writer's purpose than others; and that trees are a good approximation of the abstract structure of text. The formalization also introduces a linguistically motivated compositionality criterion, which is shown to hold for the text structures that are valid. The thesis proposes, analyzes theoretically, and compares empirically four algorithms for determining the valid text structures of a sequence of units among which some rhetorical relations hold. Two algorithms apply model-theoretic techniques; the other two apply proof-theoretic techniques.
The formalization and the algorithms mentioned so far correspond to the theoretical facet of the thesis. An exploratory corpus analysis of cue phrases provides the means for applying the formalization to unrestricted natural language texts. A set of empirically motivated algorithms were designed in order to determine the elementary textual units of a text, to hypothesize rhetorical relations that hold among these units, and eventually, to derive the discourse structure of that text. The process that finds the discourse structure of unrestricted natural language texts is called rhetorical parsing.
The thesis explores two possible applications of the text theory that it proposes. The first application concerns a discourse-based summarization system, which is shown to significantly outperform both a baseline algorithm and a commercial system. An empirical psycholinguistic experiment not only provides an objective evaluation of the summarization system, but also confirms the adequacy of using the text theory proposed here in order to determine the most important units in a text. The second application concerns a set of text planning algorithms that can be used by natural language generation systems in order to construct text plans in the cases in which the high-level communicative goal is to map an entire knowledge pool into text.
2.2 A formalization of text structures from first principles
2.2.1 The essential features of text structures
If we examine carefully the claims that current theories make with respect to the structure of text and discourse, we will find significant commonalities. Essentially, all these theories acknowledge that the elementary textual units are non-overlapping spans of text; that there exist rhetorical, coherence, and cohesive relations between textual units of various sizes; that some textual units play a more important role in text that others; and that the abstract structure of most texts is a tree-like structure. I now discuss each of these features in turn
The elementary units of complex text structures are non-overlapping spans of text.
For example, if we take clause-like spans to be the elementary units of text, the text fragment in 2.1 can be broken into 6 units, as shown below. The elementary units are delimited by square brackets.
- With its distant orbit
- | 50 percent farther from the sun than Earth |
- and slim atmospheric blanket,
- Mars experiences frigid weather conditions.
- Surface temperatures typically average about 60 degrees Celsius - 76 degrees Fahrenheit at the equator and can dip to 123 degrees C near the poles,
|1997 TheRhetoricalParsingSummAndGenOfNL||Daniel Marcu||The Rhetorical Parsing, Summarization, and Generation of Natural Language Texts||ftp://ftp.cs.toronto.edu/pub/gh/Marcu-PhDthesis.pdf||1997|