2004 ExploringLargeDocumentRepositor

(Stuckenschmidt et al., 2004) ⇒ Heiner Stuckenschmidt, Frank Van Harmelen, Anita De Waard, Tony Scerri, Ravinder Bhogal, Jan Van Buel, Ian Crowlesmith, Christiaan Fluit, Arjohn Kampman, Jeen Broekstra, and Erik van Mulligen. (2004). “Exploring Large Document Repositories with RDF Technology: The DOPE Project.” In: Intelligent Systems, IEEE Journal, 19(3). doi:10.1109/MIS.2004.9

Subject Headings: EMTREE Thesaurus, RDF Query,

Notes

Cited By

Quotes

Author Keywords

information retrieval; thesauri; user studies; visualization

Abstract

This thesaurus-based search system uses automatic indexing, RDF-based querying, and concept-based visualization of results to support exploration of large online document repositories. Innovative research institutes rely on the availability of complete and accurate information about new research and development. Information providers such as Elsevier make it their business to provide the required information in a cost-effective way. The semantic Web will likely contribute significantly to this effort because it facilitates access to an unprecedented quantity of data. The DOPE project (Drug Ontology Project for Elsevier) explores ways to provide access to multiple life-science information sources through a single interface.

1. Introduction

Innovative research institutes rely on the availability of complete and accurate information about new research and development. Information providers such as Elsevier make it their business to provide the required information in a cost-effective way. The Semantic Web will likely contribute significantly to this effort because it facilitates access to an unprecedented quantity of data. The DOPE project (Drug Ontology Project for Elsevier) explores ways to provide access to multiple lifescience information sources through a single interface.

With the unremitting growth of scientific information, integrating access to all this information remains an important problem, primarily because the information sources involved are so heterogeneous. Sources might use different syntactic standards (syntactic heterogeneity), organize information in different ways (structural heterogeneity), and even use different terminologies to refer to the same information (semantic heterogeneity). Integrated access hinges on the ability to address these different kinds of heterogeneity.

Also, mental models and keywords for accessing data generally diverge between subject areas and communities; hence, many different ontologies have emerged. An ideal architecture must therefore support the disclosure of distributed and heterogeneous data sources through different ontologies. To serve this need, we’ve developed a thesaurus-based search system that uses automatic indexing, RDF-based querying, and concept-based visualization. We describe here the conversion of an existing proprietary thesaurus to an open standard format, a generic architecture for thesaurus-based information access, an innovative user interface, and results of initial user studies with the resulting DOPE system.

Thesaurus-based information access

Thesauri have proven to be essential for effective information access. They provide controlled vocabularies for indexing information and thereby help to overcome many free-text search problems by relating and grouping relevant terms in a specific domain. Thesauri in the life sciences include MeSH, produced by the US National Library of Medicine (www. nlm.nih.gov/mesh/meshhome.html) and EMTREE, Elsevier’s life science thesaurus (http://www.elsevier.com/homepage/sah/spd/site).

These thesauri provide access to information sources (in particular document repositories) such as PubMed (http://pubmed.org) and EMBASE.com (http://embase.com), but currently no open architecture exists to support using these thesauri for querying other data sources. For example, when we move from centralized, controlled use of EMTREE within EMBASE.com to a distributed setting, we must improve access to the thesaurus with a standardized representation using open data standards that allow for semantic qualifications. RDF (Resource Description Framework) is such a standard.

Elsevier maintains the EMTREE thesaurus as a terminological resource for life science researchers. EMTREE is used to index EMBASE, a human-indexed online database. EMTREE currently contains the following information types.

Facets are broad topic areas that divide the thesaurus into independent hierarchies.
Each facet consists of a hierarchy of preferred terms used as index keywords to describe a resource’s information content. Facet names are not themselves preferred terms, and they cannot be used as index keywords. A term can occur in more than one facet; that is, EMTREE is poly-hierarchical.
Preferred terms are enriched by a set of synonyms — alternative terms that can be used to refer to the corresponding preferred term. A person can use synonyms to index or query information, but they will be normalized to the preferred term internally.
Links, a subclass of the preferred terms, serve as subheadings for other index keywords. They denote a context or aspect for the main term to which they are linked. Two kinds of link terms, drug-links and disease-links, can be used as subheadings for a term denoting a drug or a disease.

EMTREE 2003 contains about 45,000 preferred terms and 190,000 synonyms organized in a multilevel hierarchy. The EMTREE thesaurus serves primarily as a normalized vocabulary for matching user requests against documents in the target sources. This project uses natural language technology provided by Collexis (www.collexis.com)1 to automatically index documents in several different repositories with keywords from EMTREE. A Collexis fingerprint server houses the results and can be queried via a SOAP interface. (A Collexis fingerprint is very small representation of the characteristic concepts in a piece of source text.)

Natural language frequently refers to the same concept in several ways. The SOAP interface contains an indexing engine that uses EMTREE’s synonym relations to return keywords most likely to be relevant to a given search input string. Also, EMTREE’s hierarchical relations can identify keywords more specific than the target keyword, letting users expand their searches and thus gain much better recall. The results are ordered by relevance.

Among our challenges was identifying the minimal set of metadata (from each source) to be stored. The user interface assumes that several metadata are available for retrieval or display. The DOPE prototype uses indexes of the full content of ScienceDirect (full-text articles) and the last 10 years of Medline. These sources have different sets of metadata, and future DOPE versions will standardize them using the Dublin Core Metadata Initiative (http://dublincore.org). In general, however, DOPE permits easy inclusion of new data sources.

…

References

1. E.M. Van Mulligen, Et Al., "Research for Research: Tools for Knowledge Discovery and Visualization," Proc. 2002 AMIA Ann. Symp., Am. Medical Informatics Assoc., 2002, Pp. 835-839.
2. Jeen Broekstra, Arjohn Kampman, Frank Van Harmelen, Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema, Proceedings of the First International Semantic Web Conference on The Semantic Web, p.54-68, June 09-12, 2002
3. D. Beckett, Ed. RDF/XML Syntax Specification (Revised), W3C Recommendation, W3C, 10 Feb. 2004, Www.w3.org/TR/rdf-syntax-grammar.
4. RDF Vocabulary Description Language 1.0: RDF Schema, W3C Recommendation, W3C, 10 Feb. 2004, Www.w3.org/TR/rdf-schema.
5. J. Broekstra and A. Kampman, "SeRQL: Querying and Transformation with a Second-Generation Language," Technical White Paper, Aduna/Vrije Universiteit Amsterdam, Jan. 2004.
6. C. Fluit M. Sabou and F. Van Harmelen, "Ontology-Based Information Visualization," Visualizing the Semantic Web, V. Geroimenko and C. Chen, Eds., Springer-Verlag, 2003, Pp. 36-48.
7. C. Fluit M. Sabou and F. Van Harmelen, "Supporting User Tasks through Visualization of Lightweight Ontologies," Handbook on Ontologies in Information Systems, S. Staab and R. Studer, Eds., Springer-Verlag, 2003, Pp. 415-432.
8. Heiner Stuckenschmidt, Richard Vdovjak, Geert-Jan Houben, Jeen Broekstra, Index Structures and Algorithms for Querying Distributed RDF Repositories, Proceedings of the 13th International Conference on World Wide Web, May 17-20, 2004, New York, NY, USA doi:10.1145/988672.988758
9. M. Weeber, Et Al., "Generating Hypotheses by Discovering Implicit Associations in the Literature: A Case Report of a Search for New Potential Therapeutic Uses for Thalidomide," J. Am. Med. Informatics Assoc., Vol. 10, No. 3, May-June 2003, Pp. 252-259.

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2004 ExploringLargeDocumentRepositor	Jeen Broekstra Arjohn Kampman Heiner Stuckenschmidt Frank Van Harmelen Tony Scerri Ravinder Bhogal Jan Van Buel Ian Crowlesmith Christiaan Fluit Erik van Mulligen Anita de Waard			Exploring Large Document Repositories with RDF Technology: The DOPE Project				10.1109/MIS.2004.9		2004