- (Bizer et al., 2009) ⇒ Christian Bizer, Tom Heath, and Tim Berners-Lee. (2009). “Linked Data - the Story So Far.” In: Special Issue on Linked Data, International Journal on Semantic Web and Information Systems (IJSWIS), 5(3). doi:10.4018/jswis.2009081901
Subject Headings: Linked Data.
The term Linked Data refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions - the Web of Data. In this article we present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. We describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward.
The World Wide Web has radically altered the way we share knowledge by lowering the barrier to publishing and accessing documents as part of a global information space. Hypertext links allow users to traverse this information space using Web browsers, while search engines index the documents and analyse the structure of links between them to infer potential relevance to users' search queries (Brin & Page, 1998). This functionality has been enabled by the generic, open and extensible nature of the Web (Jacobs & Walsh, 2004), which is also seen as a key feature in the Web's unconstrained growth.
Despite the inarguable benefits the Web provides, until recently the same principles that enabled the Web of documents to flourish have not been applied to data. Traditionally, data published on the Web has been made available as raw dumps in formats such as CSV or XML, or marked up as HTML tables, sacrificing much of its structure and semantics. In the conventional hypertext Web, the nature of the relationship between two linked documents is implicit, as HTML is not sufficiently expressive to enable individual entities described in a particular document to be connected by typed links to related entities.
However, in recent years the Web has evolved from a global information space of linked documents to one where both documents and data are linked. Underpinning this evolution is a set of best practices for publishing and connecting structured data on the Web known as Linked Data. The adoption of the Linked Data best practices has lead to the extension of the Web with a global data space connecting data from diverse domains such as people, companies, books, scientific publications, films, music, television and radio programmes, genes, proteins, drugs and clinical trials, online communities, statistical and scientific data, and reviews. This Web of Data enables new types of applications. There are generic Linked Data browsers which allow users to start browsing in one data source and then navigate along links into related data sources. There are Linked Data search engines that crawl the Web of Data by following links between data sources and provide expressive query capabilities over aggregated data, similar to how a local database is queried today. The Web of Data also opens up new possibilities for domain-specific applications. Unlike Web 2.0 mashups which work against a fixed set of data sources, Linked Data applications operate on top of an unbound, global data space. This enables them to deliver more complete answers as new data sources appear on the Web.
The remainder of this article is structured as follows. In Section 2 we provide an overview of the key features of Linked Data. Section 3 describes the activities and outputs of the Linking Open Data project, a community effort to apply the Linked Data principles to data published under open licenses. The state of the art in publishing Linked Data is reviewed in Section 4, while Section 5 gives an overview of Linked Data applications. Section 6 compares Linked Data to other technologies for publishing structured data on the Web, before we discuss ongoing research challenges in Section 7.
Metadata Linked Data should be published alongside several types of metadata, in order to increase its utility for data consumers. In order to enable clients to assess the quality of published data and to determine whether they want to trust data, data should be accompanied with meta-information about its creator, its creation date as well as the creation method (Hartig, 2009). Basic provenance meta-information can be provided using Dublin Core terms or the Semantic Web Publishing vocabulary (Carroll et al., 2005). The Open Provenance Model (Moreau et al., 2008) provides terms for describing data transformation workflows. In (Zhao et al., 2008), the authors propose a method for providing evidence for RDF links and for tracing how the RDF links change over time.
In order to support clients in choosing the most efficient way to access Web data for the specific task they have to perform, data publishers can provide additional technical metadata about their data set and its interlinkage relationships with other data sets: The Semantic Web Crawling sitemap extension (Cyganiak et al., 2008) allows data publishers to state which alternative means of access (SPARQL endpoint, RDF dumps) are provided besides dereferenceable URIs. The Vocabulary Of Interlinked Datasets (Alexander et al., 2009) defines terms and best practices to categorize and provide statistical metainformation about data sets as well as the linksets connecting them.
A variety of Linked Data publishing tools has been developed. The tools either serve the content of RDF stores as Linked Data on the Web or provide Linked Data views over non - RDF legacy data sources. The tools shield publishers from dealing with technical details such as content negotiation and ensure that data is published according to the Linked Data community best practices (Sauermann & Cyganiak, 2008; Berrueta & Phipps, 2008; Bizer & Cyganiak & Heath, 2007). All tools support dereferencing URIs into RDF descriptions. In addition, some of the tools also provide SPARQL query access to the served data sets and support the publication of RDF dumps.
- D2R Server. D2R Server (Bizer & Cyganiak, 2006) is a tool for publishing non-RDF relational databases as Linked Data on the Web. Using a declarative mapping language, the data publisher defines a mapping between the relational schema of the database and the target RDF vocabulary. Based on the mapping, D2R server publishes a Linked Data view over the database and allows clients to query the database via the SPARQL protocol.
- Virtuoso Universal Server. The OpenLink Virtuoso server (Endnote: http://www.openlinksw.com/dataspace/dav/wiki/Main/VOSRDF) provides for serving RDF data via a Linked Data interface and a SPARQL endpoint. RDF data can either be stored directly in Virtuoso or can be created on the fly from non-RDF relational databases based on a mapping.
- Talis Platform. The Talis Platform (Endnote: http://www.talis.com/platform/) is delivered as Software as a Service accessed over HTTP, and provides native storage for RDF / Linked Data. Access rights permitting, the contents of each Talis Platform store are accessible via a SPARQL endpoint and a series of REST APIs that adhere to the Linked Data principles.
- Pubby. The Pubby server (Cyganiak & Bizer, 2008) can be used as an extension to any RDF store that supports SPARQL. Pubby rewrites URI requests into SPARQL DESCRIBE queries against the underlying RDF store. Besides RDF, Pubby also provides a simple HTML view over the data store and takes care of handling 303 redirects and content negotiation between the two representations.
- Triplify. The Triplify toolkit (Auer et al., 2009) supports developers in extending existing Web applications with Linked Data front-ends. Based on SQL query templates, Triplify serves a Linked Data and a JSON view over the application's database.
- SparqPlug. SparqPlug (Coetzee, Heath and Motta, 2008) is a service that enables the extraction of Linked Data from legacy HTML documents on the Web that do not contain RDF data. The service operates by serialising the HTML DOM as RDF and allowing users to define SPARQL queries that transform elements of this into an RDF graph of their choice.
- OAI2LOD Server. The OAI2LOD (Haslhofer & Schandl, 2008) is a Linked Data wrapper for document servers that support the Open Archives OAI-RMH protocol.
- SIOC Exporters.
The SIOC project has developed Linked Data wrappers for several popular blogging engines, content management systems and discussion forums such as WordPress, Drupal, and phpBB [ Endnote: http://sioc-project.org/exporters ].
A service that helps publishers to debug their Linked Data site is the Vapour validation service [ Endnote: http://vapour.sourceforge.net/ ]. Vapour verifies that published data complies with the Linked Data principles and community best practices.
5. Linked Data Applications
With significant volumes of Linked Data being published on the Web, numerous efforts are underway to research and build applications that exploit this Web of Data. At present these efforts can be broadly classified into three categories: Linked Data browsers, Linked Data search engines, and domain-specific Linked Data applications. In the following section we will examine each of these categories.
Linked Data Browsers
|2009 LinkedDatatheStorySoFar||Tim Berners-Lee|
|Linked Data - the Story So Far||2009|