2000 AdvancesinDomainIndependentLine

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

This paper describes a method for linear text segmentation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering.

1 Introduction

Even moderately long documents typically address several topics or different aspects of the same topic. The aim of linear text segmentation is to discover the topic boundaries. The uses of this procedure include information retrieval (Hearst and Plaunt, 1993; Hearst, 1994; Yaari, 1997; Reynar, 1999), summarization (Reynar, 1998), text understanding, anaphora resolution (Kozima, 1993), language modelling (Morris and Hirst, 1991; Beeferman et al., 199717) and improving document navigation for the visually disabled (Choi, 2000).

This paper focuses on domain independent methods for segmenting written text. We present a new algorithm that builds on previous work by Reynar (Reynar, 1998; Reynar, 1994). The primary distinction of our method is the use of a ranking scheme and the cosine similarity measure (van Rijsbergen, 1979) in formulating the similarity matrix. We propose that the similarity values of short text segments is statistically insignificant. Thus, one can only rely on their order, or rank, for clustering.

2 Background

Existing work falls into one of two categories, lexical cohesion methods and multi-source methods (Yaari, 1997). The former stem from the work of Halliday and Hasan (Halliday and Hasan, 1976). They proposed that text segments with similar vocabulary are likely to be part of a coherent topic segment.

Implementations of this idea use word stem repetition (Youmans, 1991; Reynar, 1994; Ponte and Croft, 1997), context vectors (Hearst, 1994; Yaari, 1997; Kaufmann, 1999; Eichmann et al., 1999), entity repetition (Kan et al., 1998), semantic similarity (Morris and Hirst, 1991; Kozima, 1993), word distance model (Beeferman et al., 1997a) and word frequency model (Reynar, 1999) to detect cohesion. Methods for finding the topic boundaries include sliding window (Hearst, 1994), lexical chains (Morris, 1988; Kan et al., 1998), dynamic programming (Ponte and Croft, 1997; Heinonen, 1998), agglomerative clustering (Yaari, 1997) and divisive clustering (Reynar, 1994). Lexical cohesion methods are typically used for segmenting written text in a collection to improve information retrieval (Hearst, 1994; Reynat, 1998).

Multi-source methods combine lexical cohesion with other indicators of topic shift such as cue phrases, prosodic features, reference, syntax and lexical attraction (Beeferman et al., 1997a) using decision trees (Mike et al., 1994; Kurohashi and Nagao, 1994; Litman and Passonneau, 1995) and probabilistic models (Beeferman et al., 1997b; Hajime et al., 1998; Reynar, 1998). Work in this area is largely motivated by the topic detection and tracking (TDT) initiative (Allan et al., 1998). The focus is on the segmentation of transcribed spoken text and broadcast news stories where the presentation format and regular cues can be exploited to improve accuracy.

5 Conclusions and future work

A segmentation algorithm has two key elements, a clustering strategy and a similarity me~sure. Our results show divisive clustering (R98) is more precise than sliding window (H94) and lexical chains (K98) for locating topic boundaries.

Four similarity measures were examined. The cosine coefficient (R98(s,co,)) and dot density measure (R98(m,(lot)) yield similar results. Our spread activation based semantic measure (R98( ....., )) improved a.ccura(:y. This confirms that although Kozima's apl) roaeh (Kozima, 1993) is computationally expensive, it does produce more precise segmentation.

The most significant improvement was due to our ranking scheme which linearises the cosine coefficient. Our experiments demonstrate that given insufficient data, tile qualitative behaviour of the cosine measure is indeed more reliable than the actual values.

Although our evaluation scheme is sufficient for this comparative study, further research requires a large scale, task independent benchmark. It would be interesting to compare C99 with the multi-source method described in (Beeferman et al., 1999) using the TDT corpus. We would also like to develop a linear time and multi-source version of the algorithm.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2000 AdvancesinDomainIndependentLineFreddy Y. Y. ChoiAdvances in Domain Independent Linear Text Segmentation2000