2005 FlexibleTextSegWithStructMultiClass

Jump to: navigation, search

Subject Headings: Multilabel Classification Task, Text Segmentation Task, Sequence Segmentation Statistical Models, Supervised Sequence Segmentation Task, Tokenization Task, Complex Relation Mention Recognition Task.


Cited by

~38 http://scholar.google.com/scholar?cites=700181366727296922



  • Many language processing tasks can be reduced to breaking the text into segments with prescribed properties. Such tasks include sentence splitting, tokenization, named-entity extraction, and chunking. We present a new model of text segmentation based on ideas from multilabel classification. Using this model, we can naturally represent segmentation problems involving overlapping and non-contiguous segments. We evaluate the model on entity extraction and noun-phrase chunking and show that it is more accurate for overlapping and non-contiguous segments, but it still performs well on simpler data sets for which sequential tagging has been the best method.


  • Text segmentation is a basic task in language processing, with applications such as tokenization, sentence splitting, named-entity extraction, and chunking. Many parsers, translation systems, and extraction systems rely on such segmentations to accurately process the data. Depending on the application, segments may be tokens, phrases, or sentences. However, in this paper we primarily focus on segmenting sentences into tokens.
  • The most common approach to text segmentation is to use finite-state sequence tagging models, in which each atomic text element (character or token) is labeled with a tag representing its role in a segmentation. Models of that form include hidden Markov models (Rabiner, 1989; Bikel et al., 1999) as well as discriminative tagging models based on maximum entropy classification (Ratnaparkhi, 1996; McCallum et al., 2000), conditional random fields (Lafferty et al., 2001; Sha and Pereira, 2003), and large-margin techniques (Kudo and Matsumoto, 2001; Taskar et al., 2003). Tagging models are the best previous methods for text segmentation. However, their purely sequential form limits their ability to naturally handle overlapping or noncontiguous segments.
  • We present here an alternative view of segmentation as structured multilabel classification. In this view, a segmentation of a text is a set of segments, each of which is defined by the set of text positions that belong to the segment. Thus, a particular segment may not be a set of consecutive positions in the text, and segments may overlap. Given a text x = x1 xn, the set of possible segments, which corresponds to the set of possible classification labels, is seg(x) = fO, Ign; for y 2 seg(x), yi = I iff [math]x_i[/math] belongs to the segment. Then, our segmentation task is to determine which labels are correct segments in a given text. We have thus a structured multilabel classification problem: each instance, a text, may have multiple structured labels, representing each of its segments. These labels are structured in that they do not come from a predefined set, but instead are built from sets of choices associated to the elements of arbitrarily long instances.



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2005 FlexibleTextSegWithStructMultiClassRyan T. McDonald
Koby Crammer
Fernando Pereira
Flexible Text Segmentation with Structured Multilabel ClassificationProceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processinghttp://delivery.acm.org/10.1145/1230000/1220699/p987-mcdonald.pdf?key1=1220699&key2=6474237921&coll=DL&dl=ACM&CFID=9392163&CFTOKEN=243660782005