2009 LearningLightweightOntologiesFromTxt

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Quotes

Abstract

  • The ability to provide abstractions of documents in the form of important concepts and their relations is a key asset, not only for bootstrapping the SemanticWeb, but also for relieving us from the pressure of information overload. At present, the only viable solution for arriving at these abstractions is manual curation. In this research, ontology learning techniques are used to automatically discover terms, concepts and relations from text in documents.
  • Ontology learning techniques rely on extensive background knowledge, ranging from unstructured data such as text corpora, to structured data such as a semantic lexicon. Manually-curated background knowledge is a scarce resource for many domains and languages, and the effort and cost required to keep the resource abreast of time is often high. More importantly, the size and coverage of manually-curated background knowledge is often inadequate to meet the requirements of most ontology learning techniques. This thesis investigates the use of the Web as the sole source of dynamic background knowledge across all phases of ontology learning for constructing term clouds (i.e. visual depictions of terms) and lightweight ontologies from documents. To appreciate the significance of term clouds and lightweight ontologies, a system for ontology-assisted document skimming and scanning is developed.
  • This thesis presents a novel ontology learning approach that is devoid of any manually-curated resources, and is applicable across a wide range of domains (the current focus is medicine, technology and economics). More specifically, this research proposes and develops a set of novel techniques that take advantage of Web data to address the following problems: (1) the absence of integrated techniques for cleaning noisy data; (2) the inability of current term extraction techniques to systematically explicate, diversify and consolidate their evidence; (3) the inability of current corpus construction techniques to automatically create very large, high-quality text corpora using a small number of seed terms; and (4) the difficulty of locating and preparing features for clustering and extracting relations.
  • This dissertation is organised as a series of published papers that contribute to a complete and coherent theme. The work into the individual techniques of the proposed ontology learning approach has resulted in a total of nineteen published articles: two book chapters, four journal articles, and thirteen refereed conference papers. The proposed approach consists of several major contributions to each task in ontology learning. These include (1) a technique for simultaneously correcting noises such as spelling errors, expanding abbreviations and restoring improper casing in text; (2) a novel probabilistic measure for recognising multi-word phrases; (3) a probabilistic framework for recognising domain-relevant terms using formal word distribution model s; (4) a novel technique for constructing very large, high-quality text corpora using only a small number of seed terms; and (5) novel techniques for clustering terms and discovering coarse-grained semantic relations using featureless similarity measures and dynamic Web data. In addition, a comprehensive review is included to provide background on ontology learning and recent advances in this area. The implementation details of the proposed techniques are provided at the end, together with a description on how the system is used to automatically discover term clouds and lightweight ontologies for document skimming and scanning.

4. Text Processing

4.1 Introduction

  • Automatic term recognition is the process of extracting stable lexical units from text and filtering them for the purpose of identifying terms which characterise certain domains of interest. This process involves the determination of unithood and termhood. Unithood, which is the focus of this paper, is concerned with whether or not word sequences can form stable lexical units. In particular, stable noun phrases are considered as likelier terms and are favoured in term recognition. …,


 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 LearningLightweightOntologiesFromTxtWilson WongLearning Lightweight Ontologies from Text across Different Domains using the Web as Background Knowledgehttp://explorer.csse.uwa.edu.au/reference/browse paper load.php?id=2332820622009