2014 KnowledgebasedGraphDocumentMode

(Schuhmacher & Ponzetto, 2014) ⇒ Michael Schuhmacher, and Simone Paolo Ponzetto. (2014). “Knowledge-based Graph Document Modeling.” In: Proceedings of the 7th ACM International Conference on Web search and data mining. doi:10.1145/2556195.2556250

Subject Headings: Entity Mention Linking, DBpedia.

Notes

Cited By

Quotes

Author Keywords

dbpedia; document modeling; document semantic similarity; entity relatedness; semantic network mining

Abstract

We propose a graph-based semantic model for representing document content. Our method relies on the use of a semantic network, namely the DBpedia knowledge base, for acquiring fine-grained information about entities and their semantic relations, thus resulting in a knowledge-rich document model. We demonstrate the benefits of these semantic representations in two tasks: entity ranking and computing document semantic similarity. To this end, we couple DBpedia's structure with an information-theoretic measure of concept association, based on its explicit semantic relations, and compute semantic similarity using a Graph Edit Distance based measure, which finds the optimal matching between the documents' entities using the Hungarian method. Experimental results show that our general model outperforms baselines built on top of traditional methods, and achieves a performance close to that of highly specialized methods that have been tuned to these specific tasks.

1. INTRODUCTION

Recent years have seen a great deal of work on developing wide-coverage semantic technologies and methods embedding semantic models within a wide spectrum of applications, crucially including end-user applications like, for instance, question answering [16, 47], document search [14] and web-search results clustering [34]. Complementary to this trend, many research efforts have concentrated on the automatic acquisition of machine-readable knowledge on a large scale by mining large repositories of textual data such as the Web [2, 9] (inter alia), and exploiting collaborativelyconstructed resources [40, 36, 21, 25]. As a result of this, recent years have seen a remarkable renaissance of knowledgerich approaches for many different Natural Language Processing (NLP) and Information Retrieval (IR) tasks [27].

But while recent research trends indicate that semantic information and knowledge-rich approaches can be used effectively for high-end IR and NLP tasks, much still remains to be done in order to effectively exploit these rich models and further advance the state of the art in these fields. Most of the approaches which draw upon document representations, in fact, rely solely on morpho-syntactic information by means of ` at' meaning representations like vector space models [44]. Although more sophisticated models have been proposed { including conceptual [17] and grounded [8] vector spaces { these still do not exploit the relational knowledge and network structure encoded within wide-coverage knowledge bases such as YAGO [25] or DBpedia [6].

In this paper, we aim at overcoming these issues by means of a knowledge-rich method to represent documents in the Web of Linked Data. Key to our approach is the combination of a fine-grained relation vocabulary with informationtheoretic measures of concept associativity to produce a graph-based interpretation of texts leveraging large amounts of structured knowledge, i.e., disambiguated entities and explicit semantic relations, encoded within DBpedia. Our contributions are as follows:

We propose a graph-based document model and present

a method to produce structured representations of texts that combine disambiguated entities with fine-grained semantic relations; �

We present a variety of information-theoretic measures

to weight different semantic relations within an ontology, and automatically quantify their degree of relevance with respect to the concepts they connect. Edges in the semantic graphs are thus weighted so as to capture the degree of associativity between concepts, as well as their different levels of specificity;

We evaluate our model using two highly relevant tasks,

namely entity ranking and computing document similarity. We show that our approach not only outperforms standard baselines relying on traditional, i.e., ` at', document representations, but also produces results close to those of highly specialized methods that have been particularly tuned to the respective tasks.

We develop a new measure, based on graph edit distance

techniques, in order to compute document similarity using our semantic graphs. Our approach views computing semantic distances within an ontology as a concept matching problem, and uses the Hungarian method for solving this combinatorial optimization problem.

As a result of this, we are able to provide a complete framework where documents are semantiffed by linking them to a reference knowledge base, and subgraphs of the knowledge resource are used for complex language understanding tasks like entity ranking and document semantic similarity. Results on entity ranking show that our weighting scheme helps us better estimate entity relatedness when compared with using simple, unweighted paths. Moreover, by complementing large amounts of knowledge with structured text representations we are able to achieve a robust performance on the task of computing document semantic similarity, thus competing with ` at' approaches based on either word or conceptual vector spaces, while at the same time providing a general, de-facto parameter-free model.

…

5. RELATED WORK

The recent years have seen a great deal of work on computing semantic similarity [49]. This is arguably because semantic similarity provides a valuable model of semantic compatibility that is widely applicable to a variety of complex tasks, including both pre-processing tasks like Word Sense Disambiguation [38] and coreference resolution [39], and high-end applications such as information retrieval [14] or multi-document summarization [33], to name a few.

Most of the previous work on semantic similarity has concentrated on computing pairwise similarity of words, although recent efforts concentrated on the broader task of text similarity [4], as also shown by community efforts such as the shared tasks on Semantic Textual Similarity [1]. Overall, the best results in these evaluation campaigns have been obtained by supervised models combining large feature sets [3, 45], although questions remain on whether this approach can be easily ported to domains for which no labeled data exists. In contrast, in this work we presented an unsupervised model that requires virtually no parameter tuning and exploits the implicit supervision provided by very large amounts of structured knowledge encoded in DBpedia.

This work is, to the best of our knowledge, the first to exploit a wide-coverage ontology (i.e., other than small-scale semantic lexicons likeWordNet) within a general-purpose algorithm for computing semantic similarity based on a graphbased similarity measure. Our method effectively uses large amounts of structured knowledge and can be used in principle with other such resources like, e.g., YAGO [25], provided they contain explicit semantic relations. Seminal work on representing natural language as semantic networks focused on queries [7]. Recently, graph-based representations from DBpedia have been explored by [28] for labeling topics, as obtained from a topic model, rather than providing structured representations of arbitrary texts. In addition, they limit graph construction to a small set of manually selected DBpedia relations. The work closest to ours is that of [42], who use graph-based representations of snippets for Web search results clustering. Their method also builds a document-based semantic graph from Wikipedia concepts, as obtained from the output an entity disambiguator. However, similarly to [43], they do not exploit explicit semantic relations between entities (which we show to be beneficial for both entity ranking and semantic similarity).

Previous work in computing semantic distances on linked data relied on disambiguated input [37], a requirement which is very hard to satisfy for most applications working with natural language text. In contrast, our approach relies on automatic entity linking techniques, which allow us to link entity mentions in text to well-defined entities within an ontology. From a general perspective, our work can be viewed as building upon seminal research work in IR that explored the use of controlled vocabularies [30], originally introduced for library systems. The proposed method can thus be seen as instance of an advanced Knowledge Organization System (KOS) [48, 13], since it relies at its core on a wide-coverage ontology to represent documents. However, as opposed to these approaches, we do not create a controlled vocabulary for a specific document collection, but instead reuse an existing, background ontology which contains general world knowledge. We use this knowledge source to represent the entities found documents, as opposed to using the documents' headings or metadata. The Jaccard similarity we report in Section 4.3 consists, in fact, of a baseline method that uses DBpedia as controlled vocabulary: we build upon this intuition and extend it by using the information encoded within the structure of the DBpedia network.

6. CONCLUSIONS

In this paper, we proposed a method for exploiting large amounts of machine-readable knowledge, i.e., entities and semantic relations, encoded within DBpedia, in order to provide a structured, i.e. graph-based, representation of natural language texts. Our results on entity ranking and document semantic similarity indicate that, thanks to an effective weighting of the semantic relations found within the semantic network, as well as a robust concept matching technique, we are able to achieve competitive performance on both these hard NLP tasks, while at the same time providing an unsupervised model which is practically parameter-free { namely, whose only tunable parameter can be fixed based on well-established findings from previous work.

This is the first proposal to exploit a Web-scale ontology to provide structured representations of document content, and computing semantic distances in a knowledge-rich fashion. We build thematically upon previous contributions which showed the beneficial effect of exploiting large amounts of knowledge for enhancing text comparison [46, 35] (inter alia). In our work, we take this line of research one step further by: (a) using a truly ontological resource for content modeling (as opposed, e.g., to semantic lexicons such as WordNet); (b) developing an information-theoretic measure to identify semantically specific, highly informative relations between entities in a large knowledge graph; (c) defining a new method, based on graph edit distance techniques, to quantify degrees of semantic similarity between documents: this views semantic similarity as a concept matching problem and uses the Hungarian method for solving the combinatorial optimization problem.

Our vision, ultimately, is to show how entity linking and disambiguation techniques can enable an open-domain structured representation of documents, and accordingly an even larger Web of Semantic Data, on which semantic technologies (e.g., search) can be enabled. Accordingly, we focused in this first initial step primarily on entities, since they are the bulk of wide-coverage knowledge resources like DBpedia. Clearly, extending this entity-centric model { for instance, by means of event-structured graphs [20] or RDF predicates [19] { is the next logical step. Besides, as future work we plan to develop methods to jointly perform entity disambiguation and compute semantic similarity. We are also interested in applying our techniques within domains other than newswire data, and investigating domain adaptation techniques for the graph construction phase. Our graphs naturally model fine-grained information about documents: accordingly, we will explore their application to complex, high-end task such as aspect-oriented IR, as well as finegrained document classification and clustering for IR.

}}

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2014 KnowledgebasedGraphDocumentMode	Simone P. Ponzetto Michael Schuhmacher			Knowledge-based Graph Document Modeling				10.1145/2556195.2556250