2007 UsingANatLangUndSysToGenSemWebCont

From GM-RKB
Jump to navigation Jump to search

Subject Headings: OntoSem Systeminformation extraction; natural language processing; OWL; RDF; Semantic Web .

Notes

Cited By

Quotes

Abstract

  • We describe our research on automatically generating rich semantic annotations of text and making it available on the Semantic Web. In particular, we discuss the challenges involved in adapting the OntoSem natural language processing system for this purpose. OntoSem, an implementation of the theory of ontological semantics under continuous development for over 15 years, uses a specially constructed NLP-oriented ontology and an ontological-semantic lexicon to translate English text into a custom ontology-motivated knowledge representation language, the language of text meaning representations (TMRs). OntoSem concentrates on a variety of ambiguity resolution tasks as well as processing unexpected input and reference. To adapt OntoSem’s representation to the Semantic Web, we developed a translation system, OntoSem2OWL, between the TMR language into the Semantic Web language OWL. We next used OntoSem and OntoSem2OWL to support SemNews, an experimental Web service that monitors RSS news sources, processes the summaries of the news stories, and publishes a structured representation of the meaning of the text in the news story.

Related Work

  • The general problem of automatically generating and adding semantic annotations to text has been the focus of research for many years. Most of the work has not used the Semantic Web languages for encoding these annotations. We briefly describe some of the work here and point out some similarities and differences with our own.
  • Gildea and Jurafsky (2002) created a stochastic system that labels case roles of predicates with either abstract (e.g., AGENT, THEME) or domain-specific (e.g., MESSAGE, TOPIC) roles. The system trained on 50,000 words of hand-annotated text that was produced by the FrameNet project (Baker, Fillmore & Lowe, 1998). When tasked to segment constituents and identify their semantic roles (with fillers being un-disambiguated textual strings, not machine-tractable instances of ontological concepts, as in OntoSem), the system scored in the 60s in precision and recall. Limitations of the system include its reliance on hand-annotated data and its reliance on prior knowledge of the predicate frame type (i.e., it lacks the capacity to disambiguate productively). Semantics in this project is limited to case-roles.
  • The Interlingual Annotation of Multilingual Text Corpora project (Farwell et al., 2004) had as its goal the creation of a syntactic and semantic annotation representation methodology, and tested it out on seven languages (English, Spanish, French, Arabic, Japanese, Korean, and Hindi). The semantic representation, however, is restricted to those aspects of syntax and semantics that developers believe can be consistently handled well by hand annotators for many languages. The current stage of development includes only syntax and a limited semantics — essentially, thematic roles.
  • In the ACE project, annotators carry out manual semantic annotation of texts in English, Chinese and Arabic to create training and test data for research task evaluations. The downside of this effort is that the inventory of semantic entities, relations, and events is very small, and, therefore, the resulting semantic representations are coarse-grained, for example, there are only five event types. The project description promises more fine-grained descriptors and relations among events in the future. Another response to the clear insufficiency of syntax-only tagging is offered by the developers of PropBank, the Penn Treebank semantic extension. Kingsbury et al. (2002), who report:
    • It was agreed that the highest priority, and the most feasible type of semantic annotation, is coreference and predicate argument structure for verbs, participial modifiers and nominalizations, and this is what is included in PropBank.
  • Recently, there has been interest in exploiting information extraction techniques to text to produce annotations for the Semantic Web. However, few systems capable of deeper semantic analysis have been applied in Semantic Web-related tasks. Information extraction tools work best when the types of objects that need to be identified are clearly defined, for example, the objective in MUC (Grishman & Sundheim, 1996) was to find the various named entities in text. Using OntoSem, we aim to not only provide such information, but also to convert the text meaning representation of natural language sentences into Semantic Web representations.
  • A project closely related to our work was an effort to map the Mikrokosmos knowledge base to OWL (Beltran-Ferruz, Gonzalez-Caler & Gervas, 2004a; 2004b). Mikrokosmos (Beale, Nirenburg & Mahesh, 1995) is a precursor to OntoSem and was developed with the intent of using it as an interlingua in machine translation- related work. This project developed some basic mapping functions that can create the class hierarchy and specify the properties, and their respective domains and ranges. In our system we describe how facets, numeric attribute ranges can be handled, and more importantly, we describe a technique for translating the sentences from their Text Meaning Representation to the corresponding OWL representation, thereby, providing semantically marked-up natural language text for use by other agents. Another translation effort involving Mikrokosmos produced the Omega Ontology (Philpot, Hovy & Pantel, 2005) by merging the content of Mikrokosmos with Wordnet with additional information sources.
  • Dameron, Rubin and Musen (2005) describe an approach to representing the Foundational Model of Anatomy (FMA) in OWL. FMA is a large ontology of the human anatomy and is represented in a frame-based knowledge representation language. Some of the challenges faced were the lack of equivalent OWL representations for some frame-based constructs and scalability, and computational issues with the current reasoners.
  • Schlangen, Stede and Bontas (2004) describe a system that combines a natural language processing system with Semantic Web technologies to support the content-based storage and retrieval of medical pathology reports. The language component was augmented with background knowledge consisting of a domain ontology represented in OWL. The result supported the extraction of domain-specific information from natural language reports, which was then mapped back into a Semantic Web representation.
  • TAP (R.V. Guha & McCool, 2003) is an open source project led by Stanford University and IBM Research aimed at populating the Semantic Web with information by providing tools that make the Web a giant distributed Database. TAP provides a set of protocols and conventions that create a coherent whole of independently produced bits of information, and a simple API to navigate the graph. Local, independently managed knowledge bases can be aggregated to form selected centers of knowledge useful for particular applications.
  • Krueger, Nilsson, Oates and Finin (2004) developed an application that learned to extract information from talk announcements from training data using an algorithm based on Stalker (Muslea, Minton & Knoblock, 2001). The extracted information was then encoded as markup in the Semantic Web language DAML+OIL, a precursor to OWL. The results were used as part of the ITTALKS system (Cost et al., 2002).
  • The Haystack Project has developed system (Hogue & Karger, 2005) enabling users to train browsers to extract Semantic Web content from HTML documents on the Web. Users provide examples of semantic content by highlighting them in their browser and then describing their meaning. Generalized wrappers are then constructed to extract information and encode the results in RDF. The goal is to let individual users generate Semantic Web content of interest to them from text on Web pages. More recently, the project has developed a Firefox plug-in Solvent that can be used to write screen scrapers to produce RDF data from Web pages.
  • The On-to-Knowledge project (Fensel, Harmelen & Akkermans, 2000) provides an ontology-based system for knowledge management. It uses Ontology-based Inference Layer (OIL) support for description logics (DL) and frame-based systems over the Web. OWL itself is an extension derived from OIL and DAML. The OntoExtract and OntoWrapper sub-system in On-to-knowledge were responsible for processing unstructured and structured text. These systems were used to automatically extract ontologies and express them in Semantic Web representations. At the heat of OntoExtract is an natural language processing system that processes text to perform lexical and semantic analysis. Finally, concepts found in free text are represented as an ontology.
  • The Cyc project has developed a very large knowledge base of common sense facts and reasoning capabilities. Recent efforts (Witbrock et al., 2004) include the development of tools for automatically annotating documents and exporting the knowledge in OWL. The authors also highlight the difficulties in exporting an expressive representation like CycL into OWL due to lack of equivalent constructs. Finally, we mention the KIM platform (Kiryakov, Popov, Terziev, Manov & Ognyanoff, 2004) for automatic semantic annotation, indexing, and retrieval of documents. This system uses the GATE (Cunningham, 2002) language engineering system backed by structured ontologies in OWL to produce annotations.

OntoSem

  • Ontological Semantics (OntoSem) is a theory of meaning in natural language text (Nirenburg & Raskin, 2001). The OntoSem environment is a rich and extensive tool for extracting and representing meaning in a language independent way. The OntoSem system is used for a number of applications such as machine translation, question answering, information extraction, and language generation. It is supported by a constructed world model, that is, a structured model of the classes of objects, properties, relations and constraints that might be described in text, encoded as a rich ontology. The ontology is represented as a directed acyclic graph using IS-A relations. It contains about 8,000 concepts that have on an average 16 properties per concept. At the topmost level the concepts are: OBJECT, EVENT and PROPERTY.
  • The OntoSem ontology is expressed in a frame-based representation and each of the frames corresponds to a concept. The concepts are defined using a collection of slots that could be linked using IS-A relations.

References


,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2007 UsingANatLangUndSysToGenSemWebContAkshay Java
Sergei Nirenburg
Marjorie McShane
Timothy Finin
Jesse English
Anupam Joshi
Using a Natural Language Understanding System to Generate Semantic Web Contenthttp://ilit.umbc.edu/MargePub/AkshayArticle.pdf