2003 IntegratingInfExtrAndAutoHlinking

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Information Extraction Task, Automatic Hyperlinking Task, Finite-State Automata, Unification-based Formalism, SProUT System.

Notes

  • It does not use a Machine Learning Approach.
  • http://sprout.dfki.de/
    • SProUT (Shallow Processing with Unification and Typed Feature Structures) is a platform for development of multilingual shallow text processing and information extraction systems.
    • It consists of several reusable Unicode-capable online linguistic processing components for basic linguistic operations ranging from tokenization to coreference matching. Since typed feature structures (TFS) are used as a uniform data structure for representing the input and output by each of these processing resources, they can be flexibly combined into a pipeline that produces several streams of linguistically annotated structures, which serve as an input for the shallow grammar interpreter, applied at the next stage.
    • The grammar formalism in SProUT, called XTDL is a blend of very efficient finite-state techniques and unification-based formalisms which are known to guarantee transparency and expressiveness. A grammar in SProUT consists of pattern/action rules, where the LHS of a rule is a regular expression over TFSs with functional operators and coreferences, representing the recognition pattern, and the RHS of a rule is a TFS specification of the output structure.

Cited By

  • ~6 …

Quotes

Abstract

This paper presents a novel information system integrating advanced information extraction technology and automatic hyper-linking. Extracted entities are mapped into a domain ontology that relates concepts to a selection of hyperlinks. For information extraction, we use SProUT, a generic platform for the development and use of multilingual text processing components. By combining finite-state and unification-based formalisms, the grammar formalism used in SProUT offers both processing efficiency and a high degree of decalrativeness. The ExtraLink demo system showcases the extraction of relevant concepts from German texts in the tourism domain, offering the direct connection to associated web documents on demand.

1. Introduction

The utilization of language technology for the creation of hyperlinks has a long history (e.g., Allen et al., 1993). Information extraction (IE) is a technology that can be applied to identifying both sources and targets of new hyperlinks. IE systems are becoming commercially viable in supporting diverse information discovery and management tasks. Similarly, automatic hyperlinking is a maturing technology designed to interrelate pieces of information, using ontologies to define the relationships. With ExtraLink, we present a novel information system that integrates both technologies in order to reach at an improved level of informativeness and comfort. Extraction and link generation occur completely in the background. Entities identified by the IE system are mapped into a domain ontology that relates concepts to a structured selection of predefined hyperlinks, which can be directly visualized on demand using a standard web browser. This way, the user can, while reading a text, immediately link up textual information to the Internet or to any other document base without accessing a search engine.

The quality of the link targets is much higher than with standard search engines since, first of all, only domain-specific interpretations are sought, and second, the ontology provides additional structure, including related information.

ExtraLink uses as its IE system SProUT, a generic multilingual shallow analysis platform, which currently provides linguistic processing resources for English, German, Italian, French, Spanish, Czech, Polish, Japanese, and Chinese (Becker et al., 2002). SProUT is used for tokenization, morphological analysis, and named entity recognition in free texts. In Section 2 to 4, we describe innovative features of SProUT. Section 5 gives details about the ExtraLink demonstrator.

References


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2003 IntegratingInfExtrAndAutoHlinkingStephan Busemann
Witold Drozdzynski
Hans-Ulrich Krieger
Jakub Piskorski
Ulrich Schäfer
Hans Uszkoreit
Feiyu Xu
Integrating Information Extraction and Automatic HyperlinkingProceedings of the 41st Annual Meeting on Association for Computational Linguisticshttp://acl.ldc.upenn.edu/P/P03/P03-2019.pdf10.3115/1075178.10751952003