1996 NounPhraseAnalysisInUnrTextForIR

(Evans & Zhai, 1996) ⇒ David A. Evans, Chengxiang Zhai. (1996). “Noun-Phrase Analysis in Unrestricted Text for Information Retrieval.” In: [[journal::Proceedings of the 34th annual meeting on Association for Computational Linguistics].

Subject Headings: Noun Phrase Parsing.

Quotes

Abstract

Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient noun-phrase analysis techniques to create better indexing phrases for information retrieval. In particular, we describe an hybrid approach to the extraction of meaningful (continuous or discontinuous) subcompounds from complex noun phrases using both corpus statistics and linguistic heuristics. Results of experiments show that indexing based on such extracted subcompound improves both recall and precision in an information retrieval system. The noun-phrase analysis techniques are also potentially useful for book indexing and automatic thesaurus extraction.

1.3 Our Work

In particular, we can choose to treat some phrasal structures as atomic units and others as additional information about (or representations of) content.

3. Methology

After preprocessing, the system works in two stages--parsing and generation. In the parsing stage, each simplex noun phrase in the corpus is parsed. In the generation stage, the structured noun phrase is used to generate candidates for all four kinds of small compounds, which are further tested for occurrence (validity) in the corpus.
Parsing of simplex noun phrases is done in multiple phases.
The detection of lexical atoms, like the parsing of simplex noun phrases, is also done in multiple phases. At each phase, only two adjacent units are considered. So, initiall~ only two-word lexical atoms can be detected. But, once a pair is determined to be a lexical atom, it will behave exactly like a single word in subsequent processing, so, in later phases, atoms with more than two words can be detected.
The idea of association-based parsing is that by grouping words together (based on association) many times, we will eventually discover the most restrictive (and informative) structure of a noun phrase. For example, if we have evidence from the corpus that "high performance" is a more reliable association and "general purpose" a less reliable one, then the noun phrase "general purpose high performance computer" (an actual example from the CACM corpus) would undergo the following grouping process:
- general purpose high performance computer =>
  - general purpose [high=performance] computer =>
    - [general=purpose] [high=performance] computer =>
      - [general=purpose] high=performance=computer]]

6. Conclusions

The notion of association-based parsing dates at least from (Marcus, 1980) and has been explored again recently by a number of researchers. TM The method we have developed differs from previous work in that it uses linguistic heuristics and locality scoring along with corpus statistics to generate phrase associations.
The experiment contrasting the PES with baseline processing in a commercial IR system demonstrates a direct, positive effect of the use of lexical atoms, subphrases, and other pharase associations across simplex NPs. We believe the use of N-P-substructure analysis can lead to more effective information management, including more precise IR, text summarization, and concept clustering. Our future work will explore such applications of the techniques we have described in this paper.

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
1996 NounPhraseAnalysisInUnrTextForIR	ChengXiang Zhai David A. Evans			Noun-Phrase Analysis in Unrestricted Text for Information Retrieval			http://dx.doi.org/10.3115/981863.981866	10.3115/981863.981866

1996 NounPhraseAnalysisInUnrTextForIR

Quotes

Abstract

1.3 Our Work

3. Methology

6. Conclusions

Navigation menu

Search