2009 LinkedDatatheStorySoFar

Jump to navigation Jump to search

Subject Headings: Linked Data.


Cited By


Author Keywords


The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions - the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.

1. Introduction

The World Wide Web has radically altered the way we share knowledge by lowering the barrier to publishing and accessing documents as part of a global information space. Hypertext links allow users to traverse this information space using Web browsers, while search engines index the documents and analyse the structure of links between them to infer potential relevance to users' search queries (Brin & Page, 1998). This functionality has been enabled by the generic, open and extensible nature of the Web (Jacobs & Walsh, 2004), which is also seen as a key feature in the Web's unconstrained growth.

Despite the inarguable benefits the Web provides, until recently the same principles that enabled the Web of documents to flourish have not been applied to data. Traditionally, data published on the Web has been made available as raw dumps in formats such as CSV or XML, or marked up as HTML tables, sacrificing much of its structure and semantics. In the conventional hypertext Web, the nature of the relationship between two linked documents is implicit, as HTML is not sufficiently expressive to enable individual entities described in a particular document to be connected by typed links to related entities.

However, in recent years the Web has evolved from a global information space of linked documents to one where both documents and data are linked. Underpinning this evolution is a set of best practices for publishing and connecting structured data on the Web known as Linked Data. The adoption of the Linked Data best practices has lead to the extension of the Web with a global data space connecting data from diverse domains such as people, companies, books, scientific publications, films, music, television and radio programmes, genes, proteins, drugs and clinical trials, online communities, statistical and scientific data, and reviews. This Web of Data enables new types of applications. There are generic Linked Data browsers which allow users to start browsing in one data source and then navigate along links into related data sources. There are Linked Data search engines that crawl the Web of Data by following links between data sources and provide expressive query capabilities over aggregated data, similar to how a local database is queried today. The Web of Data also opens up new possibilities for domain-specific applications. Unlike Web 2.0 mashups which work against a fixed set of data sources, Linked Data applications operate on top of an unbound, global data space. This enables them to deliver more complete answers as new data sources appear on the Web.

The remainder of this article is structured as follows. In Section 2 we provide an overview of the key features of Linked Data. Section 3 describes the activities and outputs of the Linking Open Data project, a community effort to apply the Linked Data principles to data published under open licenses. The state of the art in publishing Linked Data is reviewed in Section 4, while Section 5 gives an overview of Linked Data applications. Section 6 compares Linked Data to other technologies for publishing structured data on the Web, before we discuss ongoing research challenges in Section 7.

Metadata Linked Data should be published alongside several types of metadata, in order to increase its utility for data consumers. In order to enable clients to assess the quality of published data and to determine whether they want to trust data, data should be accompanied with meta-information about its creator, its creation date as well as the creation method (Hartig, 2009). Basic provenance meta-information can be provided using Dublin Core terms or the Semantic Web Publishing vocabulary (Carroll et al., 2005). The Open Provenance Model (Moreau et al., 2008) provides terms for describing data transformation workflows. In (Zhao et al., 2008), the authors propose a method for providing evidence for RDF links and for tracing how the RDF links change over time.

In order to support clients in choosing the most efficient way to access Web data for the specific task they have to perform, data publishers can provide additional technical metadata about their data set and its interlinkage relationships with other data sets: The Semantic Web Crawling sitemap extension (Cyganiak et al., 2008) allows data publishers to state which alternative means of access (SPARQL endpoint, RDF dumps) are provided besides dereferenceable URIs. The Vocabulary Of Interlinked Datasets (Alexander et al., 2009) defines terms and best practices to categorize and provide statistical metainformation about data sets as well as the linksets connecting them.

Publishing Tools

A variety of Linked Data publishing tools has been developed. The tools either serve the content of RDF stores as Linked Data on the Web or provide Linked Data views over non - RDF legacy data sources. The tools shield publishers from dealing with technical details such as content negotiation and ensure that data is published according to the Linked Data community best practices (Sauermann & Cyganiak, 2008; Berrueta & Phipps, 2008; Bizer & Cyganiak & Heath, 2007). All tools support dereferencing URIs into RDF descriptions. In addition, some of the tools also provide SPARQL query access to the served data sets and support the publication of RDF dumps.

The SIOC project has developed Linked Data wrappers for several popular blogging engines, content management systems and discussion forums such as WordPress, Drupal, and phpBB [ Endnote: http://sioc-project.org/exporters ].

A service that helps publishers to debug their Linked Data site is the Vapour validation service [ Endnote: http://vapour.sourceforge.net/ ]. Vapour verifies that published data complies with the Linked Data principles and community best practices.

5. Linked Data Applications

With significant volumes of Linked Data being published on the Web, numerous efforts are underway to research and build applications that exploit this Web of Data. At present these efforts can be broadly classified into three categories: Linked Data browsers, Linked Data search engines, and domain-specific Linked Data applications. In the following section we will examine each of these categories.

Linked Data Browsers



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 LinkedDatatheStorySoFarTim Berners-Lee
Christian Bizer
Tom Heath
Linked Data - the Story So Far2009