2015 ROCKERARefinementOperatorforKey

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Refinement Operator, ROCKER, Key Discovery, Linked Data.

Notes

  • Copyright: Copyright is held by the International World Wide Web Conference Committee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the author’s site if the Material is used in electronic media.

Cited By =

Quotes

Author Keywords

Abstract

The Linked Data principles provide a decentral approach for publishing structured data in the RDF format on the Web. In contrast to structured data published in relational databases where a key is often provided explicitly, finding a set of properties that allows identifying a resource uniquely is a non-trivial task. Still, finding keys is of central importance for manifold applications such as resource deduplication, link discovery, logical data compression and data integration. In this paper, we address this research gap by specifying a refinement operator, dubbed ROCKER, which we prove to be finite, proper and non-redundant. We combine the theoretical characteristics of this operator with two monotonicities of keys to obtain a time-efficient approach for detecting keys, i.e., sets of properties that describe resources uniquely. We then utilize a hash index to compute the discriminability score efficiently. Therewith, we ensure that our approach can scale to very large knowledge bases. Results show that ROCKER yields more accurate results, has a comparable runtime, and consumes less memory w.r.t. existing state-of-the-art techniques.

1. INTRODUCTION

The number of facts published in the Linked Data Web has grown considerably over the last years [3]. In particular, large knowledge bases such as LinkedTCGA [20] and LinkedGeoData [24] encompass more than 20 billion triples each. The architectural principles behind the Linked Data Web are akin to those on the Web. In particular, the decentral data publication process leads to facts on the same reaI—world entities being published across manifold knowledge bases. For example, information on Austin, Teams is distributed across several knowledge bases, including DBpedia [1], LinkedGeoData and GeoNaInes [2]. Given the size of the current Linked Data datasets, providing unique means to characterize resources within existing datasets would facilitate the use of these knowledge bases, for example within the context of entity search, data integration, linked data compression and link discovery [18]. Especially for the link discovery task, being provided with unique descriptions of resources in a knowledge base would allow for the more time-efficient computation of property matchings for link specifications, a task that has been pointed out to be particularly tedious in previous work [4].

In relational databases, keys are commonly either artificial or sets of columns that allow to describe each resource uniquely. Previous works [18, 2, 6] adopt this approach for uniquely describing RDF data and use properties instead of columns. Several problems occur when trying to detect keys for RDF data.

1. Resources from the same datasets might not all have the same properties. For example, in the fragment of DBpedia 3.9 shown in Figure 1, only 50% 0f the resources have a :meshNumber. Thus, while the :mesh- Number is unique, it cannot be used as a key for this dataset.

2. The inverse problem exists for the :graySub j ect, which covers all resources but is not unique as the trigeminal nerve and the lacrimal nerve have the same :graySub- ject. For our toy dataset, only keys of size larger that 1 exist (e.g., {:graySubject, :grayPage}).

3. The key discovery problem is exponential in the number of properties 71 in the knowledge base, as the solution space contains 2" 1 possible sets of keys. Thus, naive solutions to the key discovery problem do not scale.

Moreover, depending on the use case, key discovery approaches have to be able to detect a single key (e.g., to link resources within a knowledge base) or to detect all keys for a resource (e.g., when integrating data across knowledge bases).

In this paper, we address the three problems of key discovery within both settings of key discovery (i.e., finding all keys or alInost—keys within a given threshold) by using a refinement operator dubbed p. This operator is able to detect sets of properties that describe any instance of a given class in a unique manner. By these means, it can generate n—tuples of property values that can be used as keys for resources which instantiate a given class. Our operator relies on a scoring function to compare sets of properties. Based on this comparison, it can efficiently detect single keys, all keys and even predict whether a key can be found in a given dataset. In addition to being finite, non—redundant and proper, our operator also scales well and can thus be used on large knowledge bases. Our contributions are:

  • We provide the first refinement operator for key discovery on RDF knowledge bases.
  • We prove that our operator is finite, non—redundant, proper, but not complete.
  • We utilize the combination of a hash index to compute the discriminability score, i.e. the ability for a set of properties to distinguish their subjects, with two Inonotonicities of keys to prune the refinement tree and thus ensure that our operator scales.
  • We show that our approach succeeds on datasets where current state—of—the—art approaches fail.
  • We evaluate our operator on the OAEI instance matching benchmark datasets as well as on DBpedia classes with large populations. In particular, we measure the overall runtime, the memory consumption and the reduction ratio of our approach. Our results suggest that we outperform the state of the art w.r.t. correctness and memory consumption. Moreover, our results suggest that our approach terminates within an acceptable time frame even on very large datasets.

The rest of this paper is structured as follows: We begin by defining the problem at hand formally. Thereafter, we present our operator and prove its theoretical characteristics. After a discussion of related work, we evaluate our operator on synthetic and real data. We then conclude and present some future work.

Figure 1: Fragment from a knowledge base on human nerves. The fragment was extracted from DBpedia 3.9.

2. PRELIMINARIES

2.1 Keys

2.2 Discriminability

2.3 Properties of a key

3. A REFINEMENT OPERATOR FOR KEY DISCOVERY

4. APPROACH

4.1 Implementation

4.2 Definition of the score function

4.3 Refinement Operator

4.4 Search Strategy

5. RELATED WORK

6. EVALUATION

6.1 Experimental Setup

8. CONCLUSION AND FUTURE WORK =

9. ACKNOWLEDGMENTS

FOOTNOTES

  1. 1 http: //dbpedia.org
  2. http://www.geona.mes.org

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2015 ROCKERARefinementOperatorforKeyTommaso Soru
Edgard Marx
Axel-Cyrille Ngonga Ngomo
ROCKER: A Refinement Operator for Key Discovery10.1145/2736277.27416422015