2013 OntoLearnReloadedAGraphBasedAlg

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Ontolearn; Ontology Learning System.

Notes

Cited By

Quotes

Abstract

In 2004 we published in this journal an article describing OntoLearn, one of the first systems to automatically induce a taxonomy from document]]s and Web sites. Since then, OntoLearn has continued to be an active area of research in our group and has become a reference work within the community. In this paper we describe[our next-generation taxonomy learning methodology, which we name OntoLearn Reloaded. Unlike many taxonomy learning approaches in the literature, our novel algorithm learns both concepts and relations entirely from scratch via the automated extraction of terms, definitions, and hypernyms. This results in a very dense, cyclic and potentially disconnected hypernym graph. The algorithm then induces a taxonomy from this graph via optimal branching and a novel weighting policy. Our experiments show that we obtain high-quality results, both when building brand-new taxonomies and when reconstructing sub-hierarchies of existing taxonomies.

1. Introduction

Ontologies have proven useful for different applications, such as heterogeneous data integration, information search and retrieval, question answering, and, in general, for fostering interoperability between systems. Ontologies can be classified into three main types (Sowa 2000), namely: i) formal ontologies, that is, conceptualizations whose categories are distinguished by axioms and formal definitions, stated in logic to support complex inferences and computations; ii) prototype-based ontologies, which are based on typical instances or prototypes rather than axioms and definitions in logic; iii) lexicalized (or terminological) ontologies, which are specified by subtype-supertype relations and describe concepts by labels or synonyms rather than by prototypical instances.

Here we focus on lexicalized ontologies because, in order to enable natural language applications such as semantically enhanced information retrieval and question answering, we need a clear connection between our formal representation of the domain and the language used to express domain meanings within text. And, in turn, this connection can be established by producing full-fledged lexicalized ontologies for the domain of interest. Manually constructing ontologies is a very demanding task, however, requiring a large amount of time and effort, even when principled solutions are used (De Nicola, Missikoff, and Navigli 2009). A quite recent challenge, referred to as ontology learning, consists of automatically or semi-automatically creating a lexicalized ontology using textual data from corpora or the Web (Gomez-Perez and Manzano-Mancho 2003; Biemann 2005; Maedche and Staab 2009; Petasis et al. 2011). As a result of ontology learning, the heavy requirements of manual ontology construction can be drastically reduced.

In this paper we deal with the problem of learning a taxonomy (i.e., the backbone of an ontology) entirely from scratch. Very few systems in the literature address this task. OntoLearn (Navigli and Velardi 2004) was one of the earliest contributions in this area. In OntoLearn taxonomy learning was accomplished in four steps: terminology extraction, derivation of term sub-trees via string inclusion, disambiguation of domain terms using a novel Word Sense Disambiguation algorithm, and combining the sub-trees into a taxonomy. The use of a static, general-purpose repository of semantic knowledge, namely, WordNet (Miller et al. 1990; Fellbaum 1998), prevented the system from learning taxonomies in technical domains, however.

In this paper we present OntoLearn Reloaded, a graph-based algorithm for learning a taxonomy from the ground up. OntoLearn Reloaded preserves the initial step of our 2004 pioneering work (Navigli and Velardi 2004), that is, automated terminology extraction from a domain corpus, but it drops the requirement for WordNet (thereby avoiding dependence on the English language). It also drops the term compositionality assumption that previously led to us having to use a Word Sense Disambiguation algorithm—namely, SSI (Navigli and Velardi 2005)—to structure the taxonomy. Instead, we now exploit textual definitions, extracted from a corpus and the Web in an iterative fashion, to automatically create a highly dense, cyclic, potentially disconnected hypernym graph. An optimal branching algorithm is then used to induce a full-fledged tree-like taxonomy. Further graph-based processing augments the taxonomy with additional hypernyms, thus producing a Directed Acyclic Graph (DAG).

Our system provides a considerable advancement over the state of the art in taxonomy learning:

  • First, excepting for the manual selection of just a few upper nodes, this is the first algorithm that has been experimentally shown to build from scratch a new taxonomy (i.e., both concepts and hypernym relations) for arbitrary domains, including very technical ones for which gold-standard taxonomies do not exist.
  • Second, we tackle the problem with no simplifying assumptions: We cope with issues such as term ambiguity, complexity of hypernymy patterns, and multiple hypernyms.
  • Third, we propose a novel algorithm to extract an optimal branching from the resulting hypernym graph, which—after some recovery steps — becomes our final taxonomy. Taxonomy induction is the main theoretical contribution of the paper.
  • Fourth, the evaluation is not limited, as it is in most papers, to the number of retrieved hypernymy relations that are found in a reference taxonomy. Instead, we also analyze the extracted taxonomy in its entirety; furthermore, we acquire two “brand new” taxonomies in the domains of Artificial Intelligence and Finance.
  • Finally, our taxonomy-building workflow is fully implemented and the software components are either freely available from our Web site [1],or reproducible.

In this paper we extend our recent work on the topic (Navigli, Velardi, and Faralli 2011) as follows: i) we describe in full detail the taxonomy induction algorithm; ii) we enhance our methodology with a final step aimed at creating a DAG, rather than a strict tree-like taxonomical structure; iii) we perform a large-scale multi-faceted evaluation of the taxonomy learning algorithm on six domains; and iv) we contribute a novel methodology for evaluating an automatically learned taxonomy against a reference gold standard.

In Section 2 we illustrate the related work. We then describe our taxonomy-induction algorithm in Section 3. In Section 4 we present our experiments, and discuss the results. Evaluation is both qualitative (on new Artificial Intelligence and Finance taxonomies), and quantitative (on WordNet and MeSH sub-hierarchies). Section 5 is dedicated to concluding remarks.

2. Related Work

Two main approaches are used to learn an ontology from text: rule-based and distributional approaches. Rule-based approaches use predefined rules or heuristic patterns to extract terms and relations. These approaches are typically based on lexico-syntactic patterns, first introduced by Hearst (1992). Instances of relations are harvested from text by applying patterns aimed at capturing a certain type of relation (e.g., X is a kind of Y). Such lexico-syntactic patterns can be defined manually (Berland and Charniak 1999; Kozareva, Riloff, and Hovy 2008) or obtained by means of bootstrapping techniques (Girju, Badulescu, and Moldovan 2006; Pantel and Pennacchiotti 2006). In the latter case, a number of term pairs in the wanted relation are manually picked and the relation is sought within text corpora or the Web. Other rule-based approaches learn a taxonomy by applying heuristics to collaborative resources such as Wikipedia (Suchanek, Kasneci, and Weikum 2008; Ponzetto and Strube 2011), also with the supportive aid of computational lexicons such as WordNet (Ponzetto and Navigli 2009).

Distributional approaches, instead, model ontology learning as a clustering or classification task, and draw primarily on the notions of distributional similarity (Pado and Lapata 2007; Cohen and Widdows 2009), clustering of formalized statements (Poon and Domingos 2010), or hierarchical random graphs (Fountain and Lapata 2012). Such approaches are based on the assumption that paradigmatically-related concepts[1]appear in similar contexts and their main advantage is that they are able to discover relations that do not explicitly appear in the text. They are typically less accurate, however, and the selection of feature types, notion of context, and similarity metrics vary considerably depending on the specific approach used.

Recently, Yang and Callan (2009) presented a semi-supervised taxonomy induction framework that integrates contextual, co-occurrence, and syntactic dependencies, lexico-syntactic patterns, and other features to learn an ontology metric, calculated in terms of the semantic distance for each pair of terms in a taxonomy. Terms are incrementally clustered on the basis of their ontology metric scores. In their work, the authors assume that the set of ontological concepts C is known, therefore taxonomy learning is limited to finding relations between given pairs in C. In the experiments, they only use the word senses within a particular WordNet sub-hierarchy so as to avoid any lexical ambiguity. Their best experiment obtains a 0.85 precision rate and 0.32 recall rate in replicating is-a links on 12 focused WordNet sub-hierarchies, such as People, Building, Place, Milk, Meal, and so on.

Snow, Jurafsky, and Ng (2006) propose the incremental construction of taxonomies using a probabilistic model. In their work they combine evidence from multiple supervised classifiers trained on very large training data sets of hyponymy and cousin relations. Given the body of evidence obtained from all the relevant word pairs in a lexico-syntactic relation, the taxonomy learning task is defined probabilistically as the problem of finding the taxonomy that maximizes the probability of having that evidence (a supervised logistic regression model is used for this). Rather than learning a new taxonomy from scratch, however, this approach aims at attaching new concepts under the appropriate nodes of an existing taxonomy (i.e., WordNet). The approach is evaluated by manually assessing the quality of the single hypernymy edges connecting leaf concepts to existing ones in WordNet, with no evaluation of a full-fledged structured taxonomy and no restriction to a specific domain. A related, weakly supervised approach aimed at categorizing named entities, and attaching them to WordNet leaves, was proposed by Pasca (2004). Other approaches use formal concept analysis (Cimiano, Hotho, and Staab 2005), probabilistic and information-theoretic measures to learn taxonomies from a folksonomy (Tang et al. 2009), and Markov logic networks and syntactic parsing applied to domain text (Poon and Domingos 2010).

The work closest to ours is that presented by Kozareva and Hovy (2010). From an initial given set of root concepts and basic level terms, the authors first use Hearst-like lexico-syntactic patterns iteratively to harvest new terms from the Web. As a result a set of hyponym–hypernym relations is obtained. Next, in order to induce taxonomic relations between intermediate concepts, the Web is searched again with surface patterns. Finally, nodes from the resulting graph are removed if the out-degree is below a threshold, and edges are pruned by removing cycles and selecting the longest path in the case of multiple paths between concept pairs. Kozareva and Hovy's method has some limitations, which we discuss later in this paper. Here we note that, in evaluating their methodology, the authors discard any retrieved nodes not belonging to a WordNet sub-hierarchy (they experiment on Plants, Vehicles, and Animals), thus it all comes down to Yang and Callan's (2009) experiment of finding relations between a pre-assigned set of nodes.

In practice, none of the algorithms described in the literature was actually applied to the task of creating a new taxonomy for an arbitrary domain of interest truly from scratch. Instead, what is typically measured is the ability of a system to reproduce as far as possible the relations of an already existing taxonomy (a common test is WordNet or the Open Directory Project[2]), when given the set of domain concepts. Evaluating against a gold standard is, indeed, a reasonable validation methodology. The claim to be “automatically building” a taxonomy needs also to be demonstrated on new domains for which no a priori knowledge is available, however. In an unknown domain, taxonomy induction requires the solution of several further problems, such as identifying domain-appropriate concepts, extracting appropriate hypernym relations, and detecting lexical ambiguity, whereas some of these problems can be ignored when evaluating against a gold standard (we will return to this issue in detail in Section 4). In fact, the predecessor of OntoLearn Reloaded, that is, OntoLearn (Navigli and Velardi 2004), suffers from a similar problem, in that it relies on the WordNet taxonomy to establish paradigmatic connections between concepts.

  1. Because we are concerned with lexical taxonomies, in this paper we use the words concepts and terms interchangeably.
  2. 3 http://www.dmoz.org/.

3. The Taxonomy Learning Workflow

OntoLearn Reloaded starts from an initially empty directed graph and a corpus for the domain of interest (e.g., an archive of artificial intelligence papers). We also assume that a small set of upper terms (entity, abstraction, etc.), which we take as the end points of our algorithm, has been manually defined (e.g., from a general purpose taxonomy like WordNet) or is available for the domain.4

4 Although very few domain taxonomies are available, upper (core) concepts have been defined in several domains, such as Medicine, Art, Economy, and so forth. Our taxonomy-learning workflow, summarized in Figure 1, consists of five steps:

1. Initial Terminology Extraction (Section 3.1): The first step applies a term extraction algorithm to the input domain corpus in order to produce an initial domain terminology as output.

2. ' Definition & Hypernym Extraction (Section 3.2): Candidate definition sentences are then sought for the extracted domain terminology. For each term t, a domain-independent classifier is used to select well-formed definitions from the candidate sentences and extract the corresponding hypernyms of t.

3. Domain Filtering (Section 3.3): A domain filtering technique is applied to filter out those definitions that do not pertain to the domain of interest. The resulting domain definitions are used to populate the directed graph with hypernymy relations connecting t to the extracted hypernym h. Steps (2) and (3) are then iterated on the newly acquired hypernyms, until a termination condition occurs.

4. Graph Pruning (Section 3.4): As a result of the iterative phase we obtain a dense hypernym graph that potentially contains cycles and multiple hypernyms for most nodes. In this step we combine a novel weighting strategy with the Chu-Liu/Edmonds algorithm (Chu and Liu 1965; Edmonds 1967) to produce an optimal branching (i.e., a tree-like taxonomy) of the initial noisy graph.

5. Edge Recovery (Section 3.5): Finally, we optionally apply a recovery strategy to reattach some of the hypernym edges deleted during the previous step, so as to produce a full-fledged taxonomy in the form of a DAG.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2013 OntoLearnReloadedAGraphBasedAlgStefano Faralli
Roberto Navigli
Paola Velardi
OntoLearn Reloaded: A Graph-Based Algorithm for Taxonomy Induction10.1162/COLI_a_001462013