1996 NounPhraseAnalysisInUnrTextForIR

Jump to: navigation, search

Subject Headings: Noun Phrase Parsing.



  • Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient noun-phrase analysis techniques to create better indexing phrases for information retrieval. In particular, we describe an hybrid approach to the extraction of meaningful (continuous or discontinuous) subcompounds from complex noun phrases using both corpus statistics and linguistic heuristics. Results of experiments show that indexing based on such extracted subcompound improves both recall and precision in an information retrieval system. The noun-phrase analysis techniques are also potentially useful for book indexing and automatic thesaurus extraction.

1.3 Our Work

  • In particular, we can choose to treat some phrasal structures as atomic units and others as additional information about (or representations of) content.

3. Methology

  • After preprocessing, the system works in two stages--parsing and generation. In the parsing stage, each simplex noun phrase in the corpus is parsed. In the generation stage, the structured noun phrase is used to generate candidates for all four kinds of small compounds, which are further tested for occurrence (validity) in the corpus.
  • Parsing of simplex noun phrases is done in multiple phases.
  • The detection of lexical atoms, like the parsing of simplex noun phrases, is also done in multiple phases. At each phase, only two adjacent units are considered. So, initiall~ only two-word lexical atoms can be detected. But, once a pair is determined to be a lexical atom, it will behave exactly like a single word in subsequent processing, so, in later phases, atoms with more than two words can be detected.
  • The idea of association-based parsing is that by grouping words together (based on association) many times, we will eventually discover the most restrictive (and informative) structure of a noun phrase. For example, if we have evidence from the corpus that "high performance" is a more reliable association and "general purpose" a less reliable one, then the noun phrase "general purpose high performance computer" (an actual example from the CACM corpus) would undergo the following grouping process:
    • general purpose high performance computer =>
      • general purpose [high=performance] computer =>
        • [general=purpose] [high=performance] computer =>

6. Conclusions

  • The notion of association-based parsing dates at least from (Marcus, 1980) and has been explored again recently by a number of researchers. TM The method we have developed differs from previous work in that it uses linguistic heuristics and locality scoring along with corpus statistics to generate phrase associations.
  • The experiment contrasting the PES with baseline processing in a commercial IR system demonstrates a direct, positive effect of the use of lexical atoms, subphrases, and other pharase associations across simplex NPs. We believe the use of N-P-substructure analysis can lead to more effective information management, including more precise IR, text summarization, and concept clustering. Our future work will explore such applications of the techniques we have described in this paper.,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
1996 NounPhraseAnalysisInUnrTextForIRDavid A. Evans
ChengXiang Zhai
Noun-Phrase Analysis in Unrestricted Text for Information Retrievalhttp://dx.doi.org/10.3115/981863.98186610.3115/981863.9818661996
AuthorDavid A. Evans + and Chengxiang Zhai +
doi10.3115/981863.981866 +
titleNoun-Phrase Analysis in Unrestricted Text for Information Retrieval +
titleUrlhttp://dx.doi.org/10.3115/981863.981866 +
year1996 +