1999 RepresentingTextChunks

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Text Chunk.

Notes

Cited By

~132 http://scholar.google.com/scholar?cites=12459599473331341621

Quotes

Abstract

  • Dividing sentences in chunks of words is a useful preprocessing step for parsing, information extraction and information retrieval. (Ramshaw and Marcus, 1995) have introduced a ”convenient” data representation for chunking by converting it to a tagging task. In this paper we will examine seven different data representations for the problem of recognizing noun phrase chunks. We will show that the the data representation choice has a minor influence on chunking performance. However, equipped with the most suitable data representation, our memory-based learning chunker was able to improve the best published chunking results for a standard data set.

2 Methods and experiments

  • In this section we present and explain the data representation formats and the machine learning algorithm that we have used. In the final part we describe the feature representation used in our experiments.

2.1 Data representation

  • We have compared four complete and three partial data representation formats for the baseNP recognition task presented in (Ramshaw and Marcus, 1995). The four complete formats all use an I tag for words that are inside a baseNP and an O tag for words that are outside a baseNP. They differ in their treatment of chunk-initial and chunk-final words:
    • IOB1 The first word inside a baseNP immediately following another baseNP receives a B tag (Ramshaw and Marcus, 1995).
    • IOB2 All baseNP-initial words receive a B tag (Ratnaparkhi, 1998).
    • IOE1 The final word inside a baseNP immediately preceding another baseNP receives an E tag.
    • IOE2 All baseNP-final words receive an E tag.
  • We wanted to compare these data representation formats with a standard bracket representation. We have chosen to divide bracketing experiments in two parts: one for recognizing opening brackets and one for recognizing closing brackets. Additionally we have worked with another partial representation which seemed promising: a tagging representation which disregards boundaries between adjacent chunks. These boundaries can be recovered by combining this format with one of the bracketing formats. Our three partial representations are:
    • [ All baseNP-initial words receive an [ tag, other words receive a . tag.
    • ] All baseNP-final words receive a ] tag, other words receive a . tag.
    • IO Words inside a baseNP receive an I tag, others receive an O tag.
  • These partial representations can be combined in three pairs which encode the complete baseNP structure of the data:
    • [ + ] A word sequence is regarded as a baseNP if the first word has received an [ tag, the final word has received a ] tag and these are the only brackets that have been assigned to words in the sequence.
    • [ + IO In the IO format, tags of words that have received an I tag and an [ tag are changed into B tags. The result is interpreted as the IOB2 format.
    • IO + ] In the IO format, tags of words that have received an I tag and a ] tag are changed into E tags. The result is interpreted as the IOE2 format.
  • Table 1: The chunk tag sequences for the example sentence In early trading in Hong Kong Monday, gold was quoted at $ 366.50 an ounce . for seven different tagging formats. The I tag has been used for words inside a baseNP, O for words outside a baseNP, B and [ for baseNP-initial words and E and ] for baseNP-final words.
IOB1 O I I O I I B O I O O O I I B I O
IOB2 O B I O B I B O B O O O B I B I O
IOE1 O I I O I E I O I O O O I E I I O
IOE2 O I E O I E E O E O O O I E I E O
IO   O I I O I I I O I O O O I I I I O
[    . [ . . [ . [ . [ . . . [ . [ . .
]    . . ] . . ] ] . ] . . . ] . ] .

References


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1999 RepresentingTextChunksErik Tjong Kim Sang
Jorn Veenstra
Representing Text Chunkshttp://xxx.lanl.gov/PS cache/cs/pdf/9907/9907006v1.pdf